3 Dec 2012 07:07

## Naive matrix multiplication with Accelerate

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

```_______________________________________________
```
3 Dec 2012 08:06

### Re: Naive matrix multiplication with Accelerate

Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

```_______________________________________________
```
3 Dec 2012 17:41

### Re: Naive matrix multiplication with Accelerate

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

```_______________________________________________
```
4 Dec 2012 06:21

### Re: Naive matrix multiplication with Accelerate

As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 01:08

### Re: Naive matrix multiplication with Accelerate

I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 01:13

### Re: Naive matrix multiplication with Accelerate

No. But that doesn't stop me from being curious with Accelerate. Might you have a better explaination for what's happening here than Trevor's?

- Clark

On Tue, Dec 4, 2012 at 7:08 PM, Alexander Solla wrote:
I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 01:37

### Re: Naive matrix multiplication with Accelerate

Well, an m x n matrix corresponds to a linear transformation in at most min{m,n} dimensions.  In particular, this means that a 2x2 matrix corresponds to a plane, line, or the origin of 3-space, as a linear subspace.  Which of those the matrix corresponds to depends on the matrix's "rank", which is the number of linearly independent columns (or rows) in the matrix.

Do you really need to know /which/ plane or line a matrix corresponds to?  If so, reduce it using Gaussian elimination and, if appropriate, compute its eigenvectors or span.  Otherwise, think of it as a generic plane/line/0-point.

Outer products represent more of these simple facts about linear algebra.  The product of an mx1 and 1xn matrices is an mxn matrix with rank at most 1.  Trouble visualizing this means you are missing the essential facts (for the general picture as the product as a line or origin), or requires some computational details -- reducing the matrix using Gaussian elimination and determining its span.

As I said, I don't mean to be harsh, but playing with a vector algebra package without understanding vectors is like playing with a calculator without understanding multiplication.  You're better off learning what multiplication represents first, before using a machine to do it fast.  So, I can humbly recommend taking a course on the subject.  For example, https://www.coursera.org/course/matrix

On Tue, Dec 4, 2012 at 4:13 PM, Clark Gaebel wrote:
No. But that doesn't stop me from being curious with Accelerate. Might you have a better explaination for what's happening here than Trevor's?

- Clark

On Tue, Dec 4, 2012 at 7:08 PM, Alexander Solla wrote:
I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 01:43

### Re: Naive matrix multiplication with Accelerate

Sorry, I didn't realize that course was offered next year.  I read through "Matrices and Linear Algebra" when I was in high school.  And used Friedberg, Insel, Spence's "Linear Algebra" in college.

On Tue, Dec 4, 2012 at 4:37 PM, Alexander Solla wrote:
Well, an m x n matrix corresponds to a linear transformation in at most min{m,n} dimensions.  In particular, this means that a 2x2 matrix corresponds to a plane, line, or the origin of 3-space, as a linear subspace.  Which of those the matrix corresponds to depends on the matrix's "rank", which is the number of linearly independent columns (or rows) in the matrix.

Do you really need to know /which/ plane or line a matrix corresponds to?  If so, reduce it using Gaussian elimination and, if appropriate, compute its eigenvectors or span.  Otherwise, think of it as a generic plane/line/0-point.

Outer products represent more of these simple facts about linear algebra.  The product of an mx1 and 1xn matrices is an mxn matrix with rank at most 1.  Trouble visualizing this means you are missing the essential facts (for the general picture as the product as a line or origin), or requires some computational details -- reducing the matrix using Gaussian elimination and determining its span.

As I said, I don't mean to be harsh, but playing with a vector algebra package without understanding vectors is like playing with a calculator without understanding multiplication.  You're better off learning what multiplication represents first, before using a machine to do it fast.  So, I can humbly recommend taking a course on the subject.  For example, https://www.coursera.org/course/matrix

On Tue, Dec 4, 2012 at 4:13 PM, Clark Gaebel wrote:
No. But that doesn't stop me from being curious with Accelerate. Might you have a better explaination for what's happening here than Trevor's?

- Clark

On Tue, Dec 4, 2012 at 7:08 PM, Alexander Solla wrote:
I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 02:13

### Re: Naive matrix multiplication with Accelerate

Thanks for the insight Alex.
In case you are concerned, Accelerate is not in fact a linear algebra library.

On 05/12/2012, at 11:43 AM, Alexander Solla <alex.solla <at> gmail.com> wrote:

Sorry, I didn't realize that course was offered next year.  I read through "Matrices and Linear Algebra" when I was in high school.  And used Friedberg, Insel, Spence's "Linear Algebra" in college.

On Tue, Dec 4, 2012 at 4:37 PM, Alexander Solla wrote:
Well, an m x n matrix corresponds to a linear transformation in at most min{m,n} dimensions.  In particular, this means that a 2x2 matrix corresponds to a plane, line, or the origin of 3-space, as a linear subspace.  Which of those the matrix corresponds to depends on the matrix's "rank", which is the number of linearly independent columns (or rows) in the matrix.

Do you really need to know /which/ plane or line a matrix corresponds to?  If so, reduce it using Gaussian elimination and, if appropriate, compute its eigenvectors or span.  Otherwise, think of it as a generic plane/line/0-point.

Outer products represent more of these simple facts about linear algebra.  The product of an mx1 and 1xn matrices is an mxn matrix with rank at most 1.  Trouble visualizing this means you are missing the essential facts (for the general picture as the product as a line or origin), or requires some computational details -- reducing the matrix using Gaussian elimination and determining its span.

As I said, I don't mean to be harsh, but playing with a vector algebra package without understanding vectors is like playing with a calculator without understanding multiplication.  You're better off learning what multiplication represents first, before using a machine to do it fast.  So, I can humbly recommend taking a course on the subject.  For example, https://www.coursera.org/course/matrix

On Tue, Dec 4, 2012 at 4:13 PM, Clark Gaebel wrote:
No. But that doesn't stop me from being curious with Accelerate. Might you have a better explaination for what's happening here than Trevor's?

- Clark

On Tue, Dec 4, 2012 at 7:08 PM, Alexander Solla wrote:
I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________

```_______________________________________________
```
5 Dec 2012 02:20

### Re: Naive matrix multiplication with Accelerate

Thanks! I'll read through "Matricies and Linear Algebra" over the next few days.

- Clark

On Tue, Dec 4, 2012 at 7:43 PM, Alexander Solla wrote:
Sorry, I didn't realize that course was offered next year.  I read through "Matrices and Linear Algebra" when I was in high school.  And used Friedberg, Insel, Spence's "Linear Algebra" in college.

On Tue, Dec 4, 2012 at 4:37 PM, Alexander Solla wrote:
Well, an m x n matrix corresponds to a linear transformation in at most min{m,n} dimensions.  In particular, this means that a 2x2 matrix corresponds to a plane, line, or the origin of 3-space, as a linear subspace.  Which of those the matrix corresponds to depends on the matrix's "rank", which is the number of linearly independent columns (or rows) in the matrix.

Do you really need to know /which/ plane or line a matrix corresponds to?  If so, reduce it using Gaussian elimination and, if appropriate, compute its eigenvectors or span.  Otherwise, think of it as a generic plane/line/0-point.

Outer products represent more of these simple facts about linear algebra.  The product of an mx1 and 1xn matrices is an mxn matrix with rank at most 1.  Trouble visualizing this means you are missing the essential facts (for the general picture as the product as a line or origin), or requires some computational details -- reducing the matrix using Gaussian elimination and determining its span.

As I said, I don't mean to be harsh, but playing with a vector algebra package without understanding vectors is like playing with a calculator without understanding multiplication.  You're better off learning what multiplication represents first, before using a machine to do it fast.  So, I can humbly recommend taking a course on the subject.  For example, https://www.coursera.org/course/matrix

On Tue, Dec 4, 2012 at 4:13 PM, Clark Gaebel wrote:
No. But that doesn't stop me from being curious with Accelerate. Might you have a better explaination for what's happening here than Trevor's?

- Clark

On Tue, Dec 4, 2012 at 7:08 PM, Alexander Solla wrote:
I don't mean to be blunt, but have you guys taken a course in linear algebra?

On Mon, Dec 3, 2012 at 9:21 PM, Trevor L. McDonell wrote:
As far as I am aware, the only description is in the Repa paper. I you are right, it really should be explained properly somewhere…

At a simpler example, here is the outer product of two vectors [1].

vvProd :: (IsNum e, Elt e) => Acc (Vector e) -> Acc (Vector e) -> Acc (Matrix e)
vvProd xs ys = A.zipWith (*) xsRepl ysRepl
where
n           = A.size xs
m           = A.size ys

xsRepl      = A.replicate (lift (Z :. All :. m  )) xs
ysRepl      = A.replicate (lift (Z :. n   :. All)) ys

If we then `A.fold (+) 0` the matrix, it would reduce along each row producing a vector. So the first element of that vector is going to be calculated as (xs[0] * ys[0] + xs[0] * ys[1] +  … xs[0] * ys[m-1]). That's the idea we want for our matrix multiplication … but I agree, it is difficult for me to visualise as well.

I do the same sort of trick with the n-body demo to get all n^2 particle interactions.

-Trev

On 04/12/2012, at 3:41 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Ah. I see now. Silly Haskell making inefficient algorithms hard to write and efficient ones easy. It's actually kind of annoying when learning, but probably for the best.

Is there a good write-up of the algorithm you're using somewhere? The Repa paper was very brief in its explaination, and I'm having trouble visualizing the mapping of the 2D matricies into 3 dimensions.

- Clark

On Mon, Dec 3, 2012 at 2:06 AM, Trevor L. McDonell wrote:
Hi Clark,

The trick is that most accelerate operations work over multidimensional arrays, so you can still get around the fact that we are limited to flat data-parallelism only.

Here is matrix multiplication in Accelerate, lifted from the first Repa paper [1].

import Data.Array.Accelerate as A

type Matrix a = Array DIM2 a

matMul :: (IsNum e, Elt e) => Acc (Matrix e) -> Acc (Matrix e) -> Acc (Matrix e)
matMul arr brr
= A.fold (+) 0
\$ A.zipWith (*) arrRepl brrRepl
where
Z :. rowsA :. _     = unlift (shape arr)    :: Z :. Exp Int :. Exp Int
Z :. _     :. colsB = unlift (shape brr)    :: Z :. Exp Int :. Exp Int

arrRepl             = A.replicate (lift \$ Z :. All   :. colsB :. All) arr
brrRepl             = A.replicate (lift \$ Z :. rowsA :. All   :. All) (A.transpose brr)

If you use github sources rather than the hackage package, those intermediate replicates will get fused away.

Cheers,
-Trev

On 03/12/2012, at 5:07 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Hello cafe,

I've recently started learning about cuda and hetrogenous programming, and have been using accelerate [1] to help me out. Right now, I'm running into trouble in that I can't call parallel code from sequential code. Turns out GPUs aren't exactly like Repa =P.

Here's what I have so far:

import qualified Data.Array.Accelerate as A
import Data.Array.Accelerate ( (:.)(..)
, Acc
, Vector
, Scalar
, Elt
, fold
, slice
, constant
, Array
, Z(..), DIM1, DIM2
, fromList
, All(..)
, generate
, lift, unlift
, shape
)
import Data.Array.Accelerate.Interpreter ( run )

dotP :: (Num a, Elt a) => Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotP xs ys = fold (+) 0 \$ A.zipWith (*) xs ys

type Matrix a = Array DIM2 a

getRow :: Elt a => Int -> Acc (Matrix a) -> Acc (Vector a)
getRow n mat = slice mat . constant \$ Z :. n :. All

-- Naive matrix multiplication:
--
-- index (i, j) is equal to the ith row of 'a' `dot` the jth row of 'b'
matMul :: A.Acc (Matrix Double) -> A.Acc (Matrix Double) -> A.Acc (Matrix Double)
matMul a b' = A.generate (constant \$ Z :. nrows :. ncols) \$
\ix ->
let (Z :. i :. j) = unlift ix
in getRow i a `dotP` getRow j b
where
b = A.transpose b' -- I assume row indexing is faster than column indexing...
(Z :. nrows :.   _  ) = unlift \$ shape a
(Z :.   _   :. ncols) = unlift \$ shape b

This, of course, gives me errors right now because I'm calling getRow and dotP from within the generation function, which expects Exp[ression]s, not Acc[elerated computation]s.

So maybe I need to replace that line with an inner for loop? Is there an easy way to do that with Accelerate?

- Clark

_______________________________________________

_______________________________________________

_______________________________________________
```_______________________________________________