12 Dec 06:34 2012

## Kernel Loops in Accelerate

Clark Gaebel <cgaebel <at> uwaterloo.ca>

2012-12-12 05:34:43 GMT

2012-12-12 05:34:43 GMT

Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);

__device__ uint32_t rand_get(rand_state s);

__device__ rand_state rand_skip(rand_state s, uint64_t distance);

__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,

// a maximum random number to generate, and an output

// array to put the result in. The output will be in the range

// [0, rand_max).

__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {

rand_state current_state = start_state;

int i = blockDim.x*blockIdx.x + threadIdx.x;

// assumes we skip less than 1 million times per element...

current_state = rand_skip(current_state, i*1e6);

uint32_t mask = round_to_next_pow2(rand_max) - 1;

uint32_t result;

do {

result = rand_get(current_state);

current_state = rand_next(current_state);

} while(result & mask >= rand_max);

out[i] = result;

} // note: code was neither debugged, run, nor compiled.

Thanks,

- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52

[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);

__device__ uint32_t rand_get(rand_state s);

__device__ rand_state rand_skip(rand_state s, uint64_t distance);

__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,

// a maximum random number to generate, and an output

// array to put the result in. The output will be in the range

// [0, rand_max).

__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {

rand_state current_state = start_state;

int i = blockDim.x*blockIdx.x + threadIdx.x;

// assumes we skip less than 1 million times per element...

current_state = rand_skip(current_state, i*1e6);

uint32_t mask = round_to_next_pow2(rand_max) - 1;

uint32_t result;

do {

result = rand_get(current_state);

current_state = rand_next(current_state);

} while(result & mask >= rand_max);

out[i] = result;

} // note: code was neither debugged, run, nor compiled.

Thanks,

- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52

[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe