Mark Hahn | 1 Oct 2006 22:58
Picon
Picon

Re: Has anyone actually seen/used a cell system?

> They have a paper that explains it well and has some
> interesting benchmarks.
>
> http://sc06.supercomputing.org/schedule/pdf/pap225.pdf

this is quite interesting.  I wish they had done benchmarks with doubles,
especially since they alluded to, for instance, the n-body calculation 
really needing at least careful consideration of precision/resolution.
(now that I think of it, using 23 bits of mantisas on a 256^3 FFT 
sounds numerically dubious too.)

interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
efficiency of libgoto with 2 threads would be >= 80%, so flops would be 
.8*2*8*3 =~ 40 Gflops, or half a Cell chip. 
makes it hard to argue for wide use of Cell, I think...
Andrew Shewmaker | 2 Oct 2006 03:40
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Mark Hahn <hahn <at> physics.mcmaster.ca> wrote:
> > They have a paper that explains it well and has some
> > interesting benchmarks.
> >
> > http://sc06.supercomputing.org/schedule/pdf/pap225.pdf
>
> this is quite interesting.  I wish they had done benchmarks with doubles,
> especially since they alluded to, for instance, the n-body calculation
> really needing at least careful consideration of precision/resolution.
> (now that I think of it, using 23 bits of mantisas on a 256^3 FFT
> sounds numerically dubious too.)

Peak double precision performance of Ax=b is 15 GFLOPS on
a 3.2 GHz Cell.

http://icl.cs.utk.edu/iter-ref/custom/index.html?lid=104&slid=210

I can understand why they didn't use doubles in the Sequoia
paper.  DP performance of the architecture is out of their hands
and mentioning it would have distracted from their focus on
programming to the memory hierarchy.

--

-- 
Andrew Shewmaker
Geoff Jacobs | 1 Oct 2006 23:29
Picon

Re: Has anyone actually seen/used a cell system?

Mark Hahn wrote:
>> They have a paper that explains it well and has some
>> interesting benchmarks.
>>
>> http://sc06.supercomputing.org/schedule/pdf/pap225.pdf
> 
> this is quite interesting.  I wish they had done benchmarks with doubles,
> especially since they alluded to, for instance, the n-body calculation
> really needing at least careful consideration of precision/resolution.
> (now that I think of it, using 23 bits of mantisas on a 256^3 FFT sounds
> numerically dubious too.)
> 
> interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
> does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
> efficiency of libgoto with 2 threads would be >= 80%, so flops would be
> .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
> wide use of Cell, I think...

Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
http://www.pcper.com/article.php?aid=265&type=expert&pid=3

This is an order of magnitude less performance than SGEMM predictions in
the LBL paper. Unfortunately, the LBL numbers are only predictions.
http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf#search=%22sgemm%20cell%22

The linked article _is_ an evaluation of performance on an actual Cell
chip. Unfortunately, it's a lower clocked pre-production example running
an experimental pseudo-compiler. I'm interested in seeing SGEMM using
Cell-specific intrinsics. Such a benchmark should represent the maximum
(Continue reading)

Andrew Shewmaker | 2 Oct 2006 02:28
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Geoff Jacobs <gdjacobs <at> gmail.com> wrote:

> > interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
> > does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
> > efficiency of libgoto with 2 threads would be >= 80%, so flops would be
> > .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
> > wide use of Cell, I think...
>
> Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
> sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
> http://www.pcper.com/article.php?aid=265&type=expert&pid=3

The same site reports that the X6800, a 2.93 GHz Core2 and sees
almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).

http://www.pcper.com/article.php?aid=272&type=expert&pid=5

I don't know much about ScienceMark.  The website has been
replaced with advertisements.  From what I gathered from several
review sites, it is MS Windows only and single threaded.  My
guess is that Goto's implementation would perform significantly
better even with a single thread.  Unfortunately, I looked all over
and couldn't find Core2 benchmarks using Goto's BLAS.

> The linked article _is_ an evaluation of performance on an actual Cell
> chip. Unfortunately, it's a lower clocked pre-production example running
> an experimental pseudo-compiler. I'm interested in seeing SGEMM using
> Cell-specific intrinsics. Such a benchmark should represent the maximum
> practical performance peak.
>
(Continue reading)

Geoff Jacobs | 2 Oct 2006 03:06
Picon

Re: Has anyone actually seen/used a cell system?

Andrew Shewmaker wrote:
> On 10/1/06, Geoff Jacobs <gdjacobs <at> gmail.com> wrote:
> 
>> > interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops
>> per SPE.
>> > does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
>> > efficiency of libgoto with 2 threads would be >= 80%, so flops would be
>> > .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
>> > wide use of Cell, I think...
>>
>> Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
>> sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
>> http://www.pcper.com/article.php?aid=265&type=expert&pid=3
> 
> The same site reports that the X6800, a 2.93 GHz Core2 and sees
> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> 
> http://www.pcper.com/article.php?aid=272&type=expert&pid=5
> 
> I don't know much about ScienceMark.  The website has been
> replaced with advertisements.  From what I gathered from several
> review sites, it is MS Windows only and single threaded.  My
> guess is that Goto's implementation would perform significantly
> better even with a single thread.  Unfortunately, I looked all over
> and couldn't find Core2 benchmarks using Goto's BLAS.

WRT ScienceMark sgemm being multithreaded, I think you're right. The
prime web site is German, and explicitly states that BLAS tests are not
SMP. My bad :|
http://www.sciencemark.de/faq.html
(Continue reading)

Mark Hahn | 2 Oct 2006 05:24
Picon
Picon

Re: Has anyone actually seen/used a cell system?

>> The same site reports that the X6800, a 2.93 GHz Core2 and sees
>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).

hmm, those numbers are pretty low - peak should be 2.93*4 or 8,
and I'd expect 80% of peak or 19 Gflops/core for this comparison
(Opterons can do 90%, at least on my machine using HPL.)

so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only 
fair to compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!)

> indicative of per core performance on Core 2. Is it safe to say that
> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
> BLAS?

peak SGEMM/core would be 3*8=24, so 15 sounds quite low.

>> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a

do you know of something crippled in the pre-production Cell chips?

it looks like 2x is about right to me, considering that full-production
Cell appears to ship about the same time as 4x Core2.  the main question
is whether that's good enough to make Cell more than a niche product.
I've talked with a number of my better users, and they all tend to want
>=10x speedup before considering non-GP approaches (cell, fpga, gpgpu).

> I guess my biggest objection to Mark's comment was the comparison of
> SGEMM implemented in an experimental language with unproven structure
> with a theoretical calculation of Core 2 peak performance. I'd simply

(Continue reading)

Geoff Jacobs | 2 Oct 2006 08:17
Picon

Re: Has anyone actually seen/used a cell system?

Mark Hahn wrote:
>>> The same site reports that the X6800, a 2.93 GHz Core2 and sees
>>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> 
> hmm, those numbers are pretty low - peak should be 2.93*4 or 8,
> and I'd expect 80% of peak or 19 Gflops/core for this comparison
> (Opterons can do 90%, at least on my machine using HPL.)
I've consulted with some other information just to make sure I get this
right. We can't naively say that Core 2 maxes out at clock*4 or clock*8
for theoretical peak flops. Port 1 on the FPU can handle 4xSP flops, but
only simple operations like FPADD. Port 2 can handle FPMUL and FPDIV
(therefore FPADD as well) on a 4xSP vector.

So, there is a hard floor on theoretical Core 2 floating point
performance of clock*4 flops (for pure FPMUL and FPDIV), and a hard
ceiling of clock*8 flops (for a mix where FPADD is >=50%). Looking at
the source code, SGEMM is a FPMUL bruiser, which puts peak performance
closer to the floor than the ceiling. 12.5 gflops looks like an accurate
number for Core 2 SGEMM.

> so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only fair to
> compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!)
Going by die size, Cell would compare with a hypothetical 3 core Core 2
CPU. (Cell is apparently ~220mm^2, Core 2 Duo ~140mm^2)

>> indicative of per core performance on Core 2. Is it safe to say that
>> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
>> BLAS?
> 
> peak SGEMM/core would be 3*8=24, so 15 sounds quite low.
(Continue reading)

Andrew Shewmaker | 2 Oct 2006 03:23
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Andrew Shewmaker <agshew <at> gmail.com> wrote:

> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a 2.93 GHz
> Core2 at SGEMM.  That's an awfully big range, so hopefully someone
> wil be kind enough to benchmark libgoto on Core2 for us.  The history file
> indicates that libgoto is optimized for Core2, but I don't have one to test.

I apologize for replying to my own message, but the 2-6 times faster isn't a
good range since it assumes only one of the Core2 cores is used for the
upper bound (80/12.5).  Assuming that ScienceMark's BLAS scaled
perfectly across two cores, the upper bound would be about 3.

So, it looks like a preproduction 2.4 GHz Cell is about 2-3 times faster than a
2.93 GHz Core2 at SGEMM.

However, IBM intends to scale production Cells to 3.2 GHz (let's assume a
1.3x speedup).  And Intel intends to double their cores again, and we expect
them to lower the clock of those cores too.  Anandtech thinks 2.66GHz
is the fastest we'll see.

http://www.anandtech.com/mac/showdoc.aspx?i=2832&p=6

So, that might give us a 2.66/2.93*2 = 1.8x speedup for SGEMM on Intel's quad
core.  The Cell may only be 1.4-2.3 faster at SGEMM than an Intel solution by
Q107.  Most people I know would love to have that kind of speedup if it didn't
take too much effort.  Sequoia looks like it might make the level of effort
reasonable.

FYI, Charm++ is also working on the difficulty of Cell programming.

(Continue reading)


Gmane