4 Sep 2013 13:53

## Efficient matrix multiply using accelerate

```I've been trying to get some speed out of the accelerate library today.
What I want to implement is something as simple as a matrix multiply.
I'd like it to be fast and memory efficient.
Given the equation
C = AB

where
A is nxr
B is rxm
C is nxm

it seem reasonable to allocate three arrays on the GPU wiht n*r, r*m
and n*m elements respectively.

Anyone know how to achieve this with accelerate? My first thought was
to use the generate function to create the new C array, but I didn't
manage to wrap my head around all the fancy type features that pop up
when you want to return an array C that has dimensions dependent on
the dimensions of it's inputs, A and B.

I've search around a bit and found this [1] example implementation but
it is just as slow as a simple sequential algorithm in C. I would be
very thankful for any advice for working with accelerate!

Here's a snippet of what I have tried to make. There are several
errors in there. Maybe I'm approaching the problem from the wrong
angle.

matMul' arr brr =
let dotProd shp =
```

5 Sep 2013 03:47

### Re: Efficient matrix multiply using accelerate

```Hi Morten,

On 04/09/2013, at 9:53 PM, Morten Olsen Lysgaard <morten <at> lysgaard.no> wrote:

> I've been trying to get some speed out of the accelerate library today.
> What I want to implement is something as simple as a matrix multiply.
> I'd like it to be fast and memory efficient.

Well, the trouble with something like matrix multiply is that it is only simple conceptually. Making it
fast and memory [bandwidth] efficient is another matter entirely, GPU or otherwise.

> Anyone know how to achieve this with accelerate? My first thought was
> to use the generate function to create the new C array, but I didn't
> manage to wrap my head around all the fancy type features that pop up
> when you want to return an array C that has dimensions dependent on
> the dimensions of it's inputs, A and B.

If in doubt, add a type signature; sometimes you need to help GHC's type checker along a bit. Typically this
comes up when constructing/deconstructing things using lift/unlift respectively, which you'll need
to do to unpack shapes & indices.

> I've search around a bit and found this [1] example implementation but
> it is just as slow as a simple sequential algorithm in C.

What kind of GPU are you running on?
Also I haven't benchmarked that matrix multiply code, but I am not overly surprised --- it wasn't designed
to be efficient.

> Here's a snippet of what I have tried to make. There are several
> errors in there. Maybe I'm approaching the problem from the wrong
```