12 Dec 2012 06:34

## Kernel Loops in Accelerate

Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
rand_state current_state = start_state;
int i = blockDim.x*blockIdx.x + threadIdx.x;
// assumes we skip less than 1 million times per element...
current_state = rand_skip(current_state, i*1e6);
uint32_t mask = round_to_next_pow2(rand_max) - 1;
uint32_t result;
do {
result = rand_get(current_state);
current_state = rand_next(current_state);
} while(result & mask >= rand_max);

out[i] = result;
} // note: code was neither debugged, run, nor compiled.

Thanks,
- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
```_______________________________________________
```
13 Dec 2012 13:47

### Re: Kernel Loops in Accelerate

Hi Clark,

The question of sequential loops in Accelerate has come up a few times in the
past. The main sticking point is knowing how to implement them in a way that
encourages efficient programs; avoiding irregular arrays (iteration depths),
discouraging scalar versions of collective combinators, etc. Basically we need
to avoid thread divergence where possible, else the GPU SIMD hardware won't be
well utilised.

I've experimented a little with this. If you check the generated code of the
mandelbrot program [1], you'll see it has recovered the _fixed_ iteration depth
into a scalar loop.

I'll hack up something like a while loop for value iteration and see what
happens. Tentative proposal (perhaps just the second):

iterate :: Exp Int            -- fixed iteration count (maybe just a regular Int)
-> (Exp a -> Exp a)   -- function to repeatedly apply
-> Exp a              -- initial (seed) value
-> Exp a

while   :: (Exp a -> Exp Bool)
-> (Exp a -> Exp a)
-> Exp a
-> Exp a

It would be good if we could collect some additional concrete applications and
see if there are better ways to map certain classes of sequential iteration to
efficient GPU code.

Cheers,
-Trev

On 12/12/2012, at 4:34 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
rand_state current_state = start_state;
int i = blockDim.x*blockIdx.x + threadIdx.x;
// assumes we skip less than 1 million times per element...
current_state = rand_skip(current_state, i*1e6);
uint32_t mask = round_to_next_pow2(rand_max) - 1;
uint32_t result;
do {
result = rand_get(current_state);
current_state = rand_next(current_state);
} while(result & mask >= rand_max);

out[i] = result;
} // note: code was neither debugged, run, nor compiled.

Thanks,
- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

```_______________________________________________
```
13 Dec 2012 14:13

### Re: Kernel Loops in Accelerate

Sometimes we need some divergence in kernels, such as the random number generation example I just posted. Technically (but not practically), we could have a thread executing forever.

It's fine to discourage writing these loops, and with the proposed signatures there still won't be any temptation to do data-parallel work, but they're necessary for many real-world tasks on which parallel algorithms don't entirely map.

Once this is all said and done, however, I think I'll be writing an initial implementation of a parallel Monte Carlo library based on accelerate. It (accelerate) seems really promising. I especially love how I can test my programs without an Nvidia GPU, and can switch to performance test and to run real-world code.

Thanks,
- Clark

P.S. Don't forget about do-while! =)

On Thu, Dec 13, 2012 at 7:47 AM, Trevor L. McDonell wrote:
Hi Clark,

The question of sequential loops in Accelerate has come up a few times in the
past. The main sticking point is knowing how to implement them in a way that
encourages efficient programs; avoiding irregular arrays (iteration depths),
discouraging scalar versions of collective combinators, etc. Basically we need
to avoid thread divergence where possible, else the GPU SIMD hardware won't be
well utilised.

I've experimented a little with this. If you check the generated code of the
mandelbrot program [1], you'll see it has recovered the _fixed_ iteration depth
into a scalar loop.

I'll hack up something like a while loop for value iteration and see what
happens. Tentative proposal (perhaps just the second):

iterate :: Exp Int            -- fixed iteration count (maybe just a regular Int)
-> (Exp a -> Exp a)   -- function to repeatedly apply
-> Exp a              -- initial (seed) value
-> Exp a

while   :: (Exp a -> Exp Bool)
-> (Exp a -> Exp a)
-> Exp a
-> Exp a

It would be good if we could collect some additional concrete applications and
see if there are better ways to map certain classes of sequential iteration to
efficient GPU code.

Cheers,
-Trev

On 12/12/2012, at 4:34 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
rand_state current_state = start_state;
int i = blockDim.x*blockIdx.x + threadIdx.x;
// assumes we skip less than 1 million times per element...
current_state = rand_skip(current_state, i*1e6);
uint32_t mask = round_to_next_pow2(rand_max) - 1;
uint32_t result;
do {
result = rand_get(current_state);
current_state = rand_next(current_state);
} while(result & mask >= rand_max);

out[i] = result;
} // note: code was neither debugged, run, nor compiled.

Thanks,
- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

_______________________________________________
```_______________________________________________