Gus Correa | 8 Aug 2009 02:58
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

Hi Rahul, list

In case you haven't read it, this Nehalem memory guide from Dell
has good information and the memory configuration details:

http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations

A researcher here bought a Nehalem workstation (not a cluster)
with 24GB RAM also.  We followed the article recommendation,
which was also what the vendor suggested.
Maybe 24GB is more than needed, but presumably avoids the
performance penalty that would hit a 16GB configuration.
Since the computer will mostly run Matlab jobs,
and Matlab has no bounds when it comes to memory,
it may not have been a waste anyway.

Some people are reporting good results when using the
Nehalem hypethreading feature (activated on the BIOS).
When the code permits, this virtually doubles the number
of cores on Nehalems.
That feature works very well on IBM PPC-6 processors
(IBM calls it "simultaneous multi-threading" SMT, IIRR),
and scales by a factor of >1.5, at least with the atmospheric
model I tried.

This may be a useful way to explore your 24GB, say, by running 12 
processes on a 8-core node (50% oversubscribed),
instead of the 8 processes that you run today on the Barcelonas.

As for compiler flags, if you are using Intel these are probably good:
(Continue reading)

Rahul Nabar | 9 Aug 2009 01:50
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa<gus <at> ldeo.columbia.edu> wrote:
> Some people are reporting good results when using the
> Nehalem hypethreading feature (activated on the BIOS).
> When the code permits, this virtually doubles the number
> of cores on Nehalems.
> That feature works very well on IBM PPC-6 processors
> (IBM calls it "simultaneous multi-threading" SMT, IIRR),
> and scales by a factor of >1.5, at least with the atmospheric
> model I tried.

Thanks for all the useful comments, Gus!  Hyperthreading is confusing
the hell out of me. I expected to see 8 cores in cat /proc/cpuinfo Now
I see 16. (This means I must have left hyperthreading on I guess; I
ought to go to the server room; reboot and check the BIOS)

This is confusing my benchmarking too. Let's say I ran an MPI job with
-np 4. If there was no other job on this machine would hyperthreading
bring the other CPUs into play as well?

The reason I ask is this: I have noticed that a single 4 core job is
slower than two 4 core jobs run simultanously. This seems puzzling to
me.

--

-- 
Rahul
Gus Correa | 10 Aug 2009 04:34
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

Hi Rahul, list

See answers inline.

Rahul Nabar wrote:
> On Fri, Aug 7, 2009 at 7:58 PM, Gus Correa<gus <at> ldeo.columbia.edu> wrote:
>> Some people are reporting good results when using the
>> Nehalem hypethreading feature (activated on the BIOS).
>> When the code permits, this virtually doubles the number
>> of cores on Nehalems.
>> That feature works very well on IBM PPC-6 processors
>> (IBM calls it "simultaneous multi-threading" SMT, IIRR),
>> and scales by a factor of >1.5, at least with the atmospheric
>> model I tried.
> 
> 
> Thanks for all the useful comments, Gus!  Hyperthreading is confusing
> the hell out of me. 

So it is to me.
The good news is that according to all reports I read,
hyperthreading in Nehalem works well
(by contrast with the old version on Pentium-4 and
the corresponding Xeons).

I expected to see 8 cores in cat /proc/cpuinfo Now
> I see 16. (This means I must have left hyperthreading on I guess; I
> ought to go to the server room; reboot and check the BIOS)
> 

(Continue reading)

Rahul Nabar | 10 Aug 2009 05:42
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa<gus <at> ldeo.columbia.edu> wrote:

> See answers inline.

Thanks!

> So it is to me.
> The good news is that according to all reports I read,
> hyperthreading in Nehalem works well

What I am more concerned about is its implications on benchmarking and
schedulers.

(a) I am seeing strange scaling behaviours with Nehlem cores. eg A
specific DFT (Density Functional Theory) code we use is maxing out
performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
actually slower than 2 and 4 cores (depending on setup)

Just doesn't make sense to me. We are indeed doing something wrong.
And no, it isn't just bad parallelization of this code since we have
ran it on AMDs and of course performance increases with cores on a
single server for sure.

(b) We usually set up Torque / PBS / maui to also allow partial server
requests. i.e. somebody could say just get 4 cores on a server. The
other four cores could go to another job or stay empty. Question is
with hyperthreading this compartmentalization is lost isn't it? So
userA who got 4 cores could end up leeching on the other 4 cores too?
Or am I wrong?

(Continue reading)

Mikhail Kuzminsky | 11 Aug 2009 19:19
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

In message from Rahul Nabar <rpnabar <at> gmail.com> (Sun, 9 Aug 2009 
22:42:25 -0500):
>(a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>specific DFT (Density Functional Theory) code we use is maxing out
>performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>actually slower than 2 and 4 cores (depending on setup)

If this results are for HyperThreading "ON", it may be not too strange 
because of "virtual cores" competition.

But if this results are for switched off Hyperthreading - it's 
strange.
I have usual good DFT scaling w/number of cores on G03 - about in 7 
times for 8 cores.

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry RAS
Moscow
Rahul Nabar | 12 Aug 2009 00:24
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Tue, Aug 11, 2009 at 12:19 PM, Mikhail Kuzminsky<kus <at> free.net> wrote:
> If this results are for HyperThreading "ON", it may be not too strange
> because of "virtual cores" competition.
>
> But if this results are for switched off Hyperthreading - it's strange.
> I have usual good DFT scaling w/number of cores on G03 - about in 7 times
> for 8 cores.

Yes, it is very strange and I still cannot explain it very well. Do
you have scaling info for VASP? You did mention DFT codes so I was
wondering. I still haven't found much info for VASP on Nehlems. Maybe
it is some feature of this particular code. All the other tests make
sense.

And I just find it hard to believe that the Nehalem which has been so
far touted to be such a good proc. can be outperformed by by one year
old AMD Barcelonas.......I feel it's something I am doing wrong.

--

-- 
Rahul
Mark Hahn | 10 Aug 2009 14:41
Picon
Picon
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
> specific DFT (Density Functional Theory) code we use is maxing out
> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
> actually slower than 2 and 4 cores (depending on setup)

this is on the machine which reports 16 cores, right?  I'm guessing
that the kernel is compiled without numa and/or ht, so enumerates 
virtual cpus first.  that would mean that when otherwise idle, a 2-core
proc will get virtual cores within the same physical core.  and that 
your 8c test is merely keeping the first socket busy.

> other four cores could go to another job or stay empty. Question is
> with hyperthreading this compartmentalization is lost isn't it? So
> userA who got 4 cores could end up leeching on the other 4 cores too?
> Or am I wrong?

the kernel/scheduler is smart enough to do mostly the right thing WRT 
virtual cores.  when compiled properly...

>> It is possible that this is the result of not setting
>> processor affinity.
>> The Linux scheduler may not switch processes
>> across cores/processors efficiently.
>
> So let me double check my understanding. On this Nehalem if I set the
> processor affinity is that akin to disabling hyperthreading too? Or
> are these two independent concepts?

processor affinity just means restricting the set of cores a proc 
can run on.  it's orthogonal to the question of choosing the _right_ cores.
(Continue reading)

Rahul Nabar | 10 Aug 2009 18:43
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn<hahn <at> mcmaster.ca> wrote:
>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>> specific DFT (Density Functional Theory) code we use is maxing out
>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>> actually slower than 2 and 4 cores (depending on setup)
>
> this is on the machine which reports 16 cores, right?  I'm guessing
> that the kernel is compiled without numa and/or ht, so enumerates virtual
> cpus first.  that would mean that when otherwise idle, a 2-core
> proc will get virtual cores within the same physical core.  and that your 8c
> test is merely keeping the first socket busy.

No. On both machines. The one reporting 16 cores and the other
reporting 8. i.e. one hyperthreaded and the other not. Both having 8
physical cores.

What is bizarre is I tried using -np 16. THat ought to definitely
utilize all cores, right? I'd have expected the 16 core performance to
be the best. BUt no the performance peaks at a smaller number of
cores.

--

-- 
Rahul

Joshua Baker-LePain | 10 Aug 2009 21:09
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote

> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn<hahn <at> mcmaster.ca> wrote:
>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>>> specific DFT (Density Functional Theory) code we use is maxing out
>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>>> actually slower than 2 and 4 cores (depending on setup)
>>
>> this is on the machine which reports 16 cores, right?  I'm guessing
>> that the kernel is compiled without numa and/or ht, so enumerates virtual
>> cpus first.  that would mean that when otherwise idle, a 2-core
>> proc will get virtual cores within the same physical core.  and that your 8c
>> test is merely keeping the first socket busy.
>
> No. On both machines. The one reporting 16 cores and the other
> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
> physical cores.
>
> What is bizarre is I tried using -np 16. THat ought to definitely
> utilize all cores, right? I'd have expected the 16 core performance to
> be the best. BUt no the performance peaks at a smaller number of
> cores.

Well, as there are only 8 "real" cores, running a computationally 
intensive process across 16 should *definitely* do worse than across 8. 
However, it's not so surprising that you're seeing peak performance with 
2-4 threads.  Nehalem can actually overclock itself when only some of the 
cores are busy -- it's called Turbo Mode.  That *could* be what you're 
seeing.

(Continue reading)

Tom Elken | 10 Aug 2009 23:07

RE: performance tweaks and optimum memory configs for a Nehalem

> Well, as there are only 8 "real" cores, running a computationally
> intensive process across 16 should *definitely* do worse than across 8.

Not typically.

At the SPEC website there are quite a few SPEC MPI2007 (which is an average across 13 HPC applications)
results on Nehalem.

Summary:
IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core
compared to running 2 ranks on every core.  In general, on low core-counts, like up to 32 there is about an 8%
advantage for running 2 ranks per core.  At larger core counts, IBM published a pair of results on 64 cores
where the 64-rank performance was equal to the 128-rank performance.  Not all of these applications scale
linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks.

Details: Results from this year are mostly on Nehalem:
http://www.spec.org/mpi2007/results/res2009q3/ (IBM)
http://www.spec.org/mpi2007/results/res2009q2/ (Platform)
http://www.spec.org/mpi2007/results/res2009q1/ (SGI)
  (Intel has results with Turbo mode turned on and off
    in the q2 and q3 results, for a different comparison)

Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results:
http://www.spec.org/mpi2007/results/mpi2007.html

In the result index, when 
" Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on.
In these cases, you can then check that when 
" MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core.

(Continue reading)

HÃ¥kon Bugge | 11 Aug 2009 09:43
Picon
Picon

Re: performance tweaks and optimum memory configs for a Nehalem


On Aug 10, 2009, at 23:07 , Tom Elken wrote:
Summary:
IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core compared to running 2 ranks on every core.  In general, on low core-counts, like up to 32 there is about an 8% advantage for running 2 ranks per core.  At larger core counts, IBM published a pair of results on 64 cores where the 64-rank performance was equal to the 128-rank performance.  Not all of these applications scale linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks.

Details: Results from this year are mostly on Nehalem:
http://www.spec.org/mpi2007/results/res2009q3/ (IBM)
http://www.spec.org/mpi2007/results/res2009q2/ (Platform)
http://www.spec.org/mpi2007/results/res2009q1/ (SGI)
 (Intel has results with Turbo mode turned on and off
   in the q2 and q3 results, for a different comparison)

Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results:
http://www.spec.org/mpi2007/results/mpi2007.html

In the result index, when
" Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on.
In these cases, you can then check that when
" MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core.

Tom,

Thanks for the neatly compiled information above. I can just add, that I have conducted a fairly detailed analysis of Nehalem compared to HarperTown in my paper An evaluation of Intel’s core i7 architecture using a comparative approach presented at ISC´09. Here, I look at different aspect of the memory hierarchy of the two processors. The benefits from hyperthreading on the said 13 SPEC MPI2007 applications are also studied, although using only a single node, where the advantage is more pronounced

Thanks,

 
Håkon



<div>
<br><div>
<div>On Aug 10, 2009, at 23:07 , Tom Elken wrote:</div>
<blockquote type="cite"><div>Summary:<br>IBM, SGI and Platform have some comparisons on clusters with "SMT On" of running 1 rank for every core compared to running 2 ranks on every core. &nbsp;In general, on low core-counts, like up to 32 there is about an 8% advantage for running 2 ranks per core. &nbsp;At larger core counts, IBM published a pair of results on 64 cores where the 64-rank performance was equal to the 128-rank performance. &nbsp;Not all of these applications scale linearly, so on some of them you lose efficiency at 128 ranks compared to 64 ranks.<br><br>Details: Results from this year are mostly on Nehalem:<br><a href="http://www.spec.org/mpi2007/results/res2009q3/">http://www.spec.org/mpi2007/results/res2009q3/</a> (IBM)<br><a href="http://www.spec.org/mpi2007/results/res2009q2/">http://www.spec.org/mpi2007/results/res2009q2/</a> (Platform)<br><a href="http://www.spec.org/mpi2007/results/res2009q1/">http://www.spec.org/mpi2007/results/res2009q1/</a> (SGI)<br> &nbsp;(Intel has results with Turbo mode turned on and off<br> &nbsp;&nbsp;&nbsp;in the q2 and q3 results, for a different comparison)<br><br>Or you can pick out the Xeon 'X5570' and 'X5560' results from the list of all results:<br><a href="http://www.spec.org/mpi2007/results/mpi2007.html">http://www.spec.org/mpi2007/results/mpi2007.html</a><br><br>In the result index, when <br>" Compute Threads Enabled" = 2x "Compute Cores Enabled", then you know SMT is turned on.<br>In these cases, you can then check that when <br>" MPI Ranks" = " Compute Threads Enabled" then you are running 2 ranks per core.<br>
</div></blockquote>
</div>
<div><br></div>Tom,<div><br></div>
<div>Thanks for the neatly compiled information above. I can just add, that I have conducted a fairly&nbsp;detailed&nbsp;analysis&nbsp;of&nbsp;Nehalem&nbsp;compared&nbsp;to&nbsp;HarperTown&nbsp;in&nbsp;my&nbsp;paper&nbsp;<a href="http://www.springerlink.com/content/b34qn674r0m23228/?p=90e0e6dd92594c7b8b49993c7d245ed7&amp;pi=11">An evaluation of Intel&rsquo;s core i7 architecture using a&nbsp;comparative approach</a>&nbsp;presented at ISC&acute;09. Here, I look at different aspect of the memory hierarchy of the two processors. The benefits from hyperthreading on the said 13 SPEC MPI2007 applications are also studied, although using only a single node, where the advantage is more pronounced</div>
<div><br></div>
<div>Thanks,</div>
<div><br></div>
<div>&nbsp;</div>
<div>
<div apple-content-edited="true">
<div>
<div>H&aring;kon</div>
<div><br></div>
</div>
<br>
</div>
<br>
</div>
</div>
Bill Broadley | 10 Aug 2009 22:22
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

Joshua Baker-LePain wrote:

> Well, as there are only 8 "real" cores, running a computationally
> intensive process across 16 should *definitely* do worse than across 8.

I've seen many cases where that isn't true.  The P4 rarely justified turning
on HT because throughput would often be lower.  With the nehalem often it
helps, the best way to tell is to try it.

> However, it's not so surprising that you're seeing peak performance with
> 2-4 threads.  Nehalem can actually overclock itself when only some of
> the cores are busy -- it's called Turbo Mode.  That *could* be what
> you're seeing.

Indeed.

Craig Tierney | 10 Aug 2009 22:20
Picon
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

Joshua Baker-LePain wrote:
> On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote
> 
>> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn<hahn <at> mcmaster.ca> wrote:
>>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>>>> specific DFT (Density Functional Theory) code we use is maxing out
>>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>>>> actually slower than 2 and 4 cores (depending on setup)
>>>
>>> this is on the machine which reports 16 cores, right?  I'm guessing
>>> that the kernel is compiled without numa and/or ht, so enumerates
>>> virtual
>>> cpus first.  that would mean that when otherwise idle, a 2-core
>>> proc will get virtual cores within the same physical core.  and that
>>> your 8c
>>> test is merely keeping the first socket busy.
>>
>> No. On both machines. The one reporting 16 cores and the other
>> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
>> physical cores.
>>
>> What is bizarre is I tried using -np 16. THat ought to definitely
>> utilize all cores, right? I'd have expected the 16 core performance to
>> be the best. BUt no the performance peaks at a smaller number of
>> cores.
> 
> Well, as there are only 8 "real" cores, running a computationally
> intensive process across 16 should *definitely* do worse than across 8.
> However, it's not so surprising that you're seeing peak performance with
> 2-4 threads.  Nehalem can actually overclock itself when only some of
> the cores are busy -- it's called Turbo Mode.  That *could* be what
> you're seeing.
> 

We are seeing that the chips will overclock themselves even with all cores
running.  The percent increase in speed can be from 2-10% per node.  I have
never had a run (single node HPL) run as slow as it does when Turbo is
turned off.  However, with all the variation per node, there isn't much
of a win for large jobs as they will generally slow down to the slowest node.

Craig

> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf <at> beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

--

-- 
Craig Tierney (craig.tierney <at> noaa.gov)
Rahul Nabar | 10 Aug 2009 22:02
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain<jlb17 <at> duke.edu> wrote:
> Well, as there are only 8 "real" cores, running a computationally intensive
> process across 16 should *definitely* do worse than across 8. However, it's
> not so surprising that you're seeing peak performance with 2-4 threads.
>  Nehalem can actually overclock itself when only some of the cores are busy
> -- it's called Turbo Mode.  That *could* be what you're seeing.

That could very well be it! Is there any way to test if the CPU has
overclocked itself?

Or can I turn the "turbo mode" off and check?

--

-- 
Rahul

David N. Lombard | 11 Aug 2009 17:40
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, Aug 10, 2009 at 01:02:51PM -0700, Rahul Nabar wrote:
> On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain<jlb17 <at> duke.edu> wrote:
> > Well, as there are only 8 "real" cores, running a computationally intensive
> > process across 16 should *definitely* do worse than across 8.

<YMMV>
Some workloads will benefit materially from SMT, some are neutral, and some
will degrade.  For those that degrade, simply not oversubscribing the physical
cores will get best performance.
</YMMV>

> >                                                               However, it's
> > not so surprising that you're seeing peak performance with 2-4 threads.
> >  Nehalem can actually overclock itself when only some of the cores are busy
> > -- it's called Turbo Mode.  That *could* be what you're seeing.
> 
> That could very well be it! Is there any way to test if the CPU has
> overclocked itself?

There's an application note on the subect at:
<http://download.intel.com/design/processor/applnots/320354.pdf>

Be aware this document is very technical, talking about MSRs & performance counters.

> Or can I turn the "turbo mode" off and check?

That would work, but...  Alternately, take a look at
<http://software.intel.com/en-us/articles/using-enhanced-intel-speedstep-features-in-hpc-clusters/>

--

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.
Joshua Baker-LePain | 10 Aug 2009 22:07
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, 10 Aug 2009 at 3:02pm, Rahul Nabar wrote

> On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain<jlb17 <at> duke.edu> wrote:
>> Well, as there are only 8 "real" cores, running a computationally intensive
>> process across 16 should *definitely* do worse than across 8. However, it's
>> not so surprising that you're seeing peak performance with 2-4 threads.
>>  Nehalem can actually overclock itself when only some of the cores are busy
>> -- it's called Turbo Mode.  That *could* be what you're seeing.
>
> That could very well be it! Is there any way to test if the CPU has
> overclocked itself?
>
> Or can I turn the "turbo mode" off and check?

You *should* be able to turn off turbo mode in the BIOS.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
On Mon, 10 Aug 2009 at 3:02pm, Rahul Nabar wrote

> On Mon, Aug 10, 2009 at 2:09 PM, Joshua Baker-LePain<jlb17 <at> duke.edu> wrote:
>> Well, as there are only 8 "real" cores, running a computationally intensive
>> process across 16 should *definitely* do worse than across 8. However, it's
>> not so surprising that you're seeing peak performance with 2-4 threads.
>>  Nehalem can actually overclock itself when only some of the cores are busy
>> -- it's called Turbo Mode.  That *could* be what you're seeing.
>
> That could very well be it! Is there any way to test if the CPU has
> overclocked itself?
>
> Or can I turn the "turbo mode" off and check?

You *should* be able to turn off turbo mode in the BIOS.

--

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
Gus Correa | 10 Aug 2009 21:40
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

Joshua Baker-LePain wrote:
> On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote
> 
>> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn<hahn <at> mcmaster.ca> wrote:
>>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>>>> specific DFT (Density Functional Theory) code we use is maxing out
>>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>>>> actually slower than 2 and 4 cores (depending on setup)
>>>
>>> this is on the machine which reports 16 cores, right?  I'm guessing
>>> that the kernel is compiled without numa and/or ht, so enumerates 
>>> virtual
>>> cpus first.  that would mean that when otherwise idle, a 2-core
>>> proc will get virtual cores within the same physical core.  and that 
>>> your 8c
>>> test is merely keeping the first socket busy.
>>
>> No. On both machines. The one reporting 16 cores and the other
>> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
>> physical cores.
>>
>> What is bizarre is I tried using -np 16. THat ought to definitely
>> utilize all cores, right? I'd have expected the 16 core performance to
>> be the best. BUt no the performance peaks at a smaller number of
>> cores.
> 
> Well, as there are only 8 "real" cores, running a computationally 
> intensive process across 16 should *definitely* do worse than across 8. 
> However, it's not so surprising that you're seeing peak performance with 
> 2-4 threads.  Nehalem can actually overclock itself when only some of 
> the cores are busy -- it's called Turbo Mode.  That *could* be what 
> you're seeing.
> 

Hi Rahul, Joshua, list

If Rahul is running these tests with his production jobs,
which he says require 2GB/process, and if he has 24GB/node
(or is it 16GB/node?), then with 16 processes running on a node
memory paging probably kicked in,
because the physical memory is less than 32GB.
Would this be the reason for the drop in performance, Rahul?

In any case, Joshua is right that you can't expect linear scaling from
8 to 16 processes on a node.
What I saw on an IBM machine
with PPC-6 and SMT (similar to Intel hyperthreading)
was a speedup of around 1.4, rather than 2.
Still a great deal!

If I understand right, hyperthreading opportunistically uses idle
execution units on a core to schedule a second thread to use them.
As clever and efficient as it is, I would guess this mechanism
cannot produce as much work as two physical cores.
There is an article about it in Tom's Hardware:
http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-5.html

My $0.02 of guesses
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf <at> beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Mark Hahn | 10 Aug 2009 19:04
Picon
Picon
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

>> this is on the machine which reports 16 cores, right?  I'm guessing
>> that the kernel is compiled without numa and/or ht, so enumerates virtual
>> cpus first.  that would mean that when otherwise idle, a 2-core
>> proc will get virtual cores within the same physical core.  and that your 8c
>> test is merely keeping the first socket busy.
>
> No. On both machines. The one reporting 16 cores and the other
> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
> physical cores.
>
> What is bizarre is I tried using -np 16. THat ought to definitely
> utilize all cores, right? I'd have expected the 16 core performance to
> be the best. BUt no the performance peaks at a smaller number of
> cores.

I think I would still invoke kernel miscompilation, since if the kernel
isn't aware of the memory/core/socket topology, it probably makes quite 
poor affinity-oblivious allocations.  this is the machine where numactl
doesn't do anything sensible, right?
>> this is on the machine which reports 16 cores, right?  I'm guessing
>> that the kernel is compiled without numa and/or ht, so enumerates virtual
>> cpus first.  that would mean that when otherwise idle, a 2-core
>> proc will get virtual cores within the same physical core.  and that your 8c
>> test is merely keeping the first socket busy.
>
> No. On both machines. The one reporting 16 cores and the other
> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
> physical cores.
>
> What is bizarre is I tried using -np 16. THat ought to definitely
> utilize all cores, right? I'd have expected the 16 core performance to
> be the best. BUt no the performance peaks at a smaller number of
> cores.

I think I would still invoke kernel miscompilation, since if the kernel
isn't aware of the memory/core/socket topology, it probably makes quite 
poor affinity-oblivious allocations.  this is the machine where numactl
doesn't do anything sensible, right?
Rahul Nabar | 10 Aug 2009 05:33
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Sun, Aug 9, 2009 at 9:34 PM, Gus Correa<gus <at> ldeo.columbia.edu> wrote:
> Most likely it is on.
> Maybe it is the BIOS default, or the vendor set it up this way.
>
> Unfortunately I don't have access to the Nehalem machine.
> So, I can't check the /proc/cpuinfo here, play with MPI, etc.
> I helped a grad student configure it, for his thesis research,
> but the researcher who he works for is a PITA.  Bad politics.

Is there a way of finding out within Linux if Hyperthreading is on or
not? I know there is a BIOS setting but one of the machines I am
testing is remote and I do not have access to BIOS. I'll ask them
though but I am impatient to figure out!

Alternatively /proc/cpuinfo shows a bunch, say 16, cores. Is there a
way to find out if all of these are real cores or hyperthreaded?

--

-- 
Rahul

Mark Hahn | 10 Aug 2009 14:33
Picon
Picon
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

> Is there a way of finding out within Linux if Hyperthreading is on or
> not?

in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores.
that is, I'm guessing one of your nehalem's shows as having 8 siblings
and 4 cpu cores.
Renato Callado Borges | 10 Aug 2009 17:43
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, Aug 10, 2009 at 08:33:27AM -0400, Mark Hahn wrote:
>> Is there a way of finding out within Linux if Hyperthreading is on or
>> not?
>
> in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores.
> that is, I'm guessing one of your nehalem's shows as having 8 siblings
> and 4 cpu cores.

Googling for 'dmidecode Hyper Thread' I found this 2004 article:

http://www.linux.com/archive/articles/41088

And it says:

"I would have liked to just read /proc/cpuinfo to determine if Hyper-Threading is enabled, but currently
that info is not exported to that file. /proc/cpuinfo just displays the number of physical CPUs in the
system and ignores Hyper-Threading.

The process of using x86info is similar to the process of using dmidecode: execute and parse the output. In
this case, x86info will say _The physical package supports 2 logical processors_ if Hyper-Threading is
enabled on a standard Xeon system."

Installed x86info in my box, ran it and (correctly) it says my box' physical package supports 1 logical
processor. (It's a Pentium 4).

--

-- 
[]'s, RCB.
Mark Hahn | 10 Aug 2009 17:51
Picon
Picon
Favicon

Re: performance tweaks and optimum memory configs for a Nehalem

> Googling for 'dmidecode Hyper Thread' I found this 2004 article:

the info in /proc/cpuinfo has definitely changed since 2004.
Rahul Nabar | 10 Aug 2009 18:29
Picon

Re: performance tweaks and optimum memory configs for a Nehalem

On Mon, Aug 10, 2009 at 7:33 AM, Mark Hahn<hahn <at> mcmaster.ca> wrote:
> in /proc/cpuinfo, I believe it's a simple as siblings > cpu cores.
> that is, I'm guessing one of your nehalem's shows as having 8 siblings
> and 4 cpu cores.

Yes. That works. Also looking at the "physical id" helps.

I was confused by the ht flag. Apparently that is not relevant/. It
only indicates whether the CPU can report hyperthreading or not.

No wonder all my boxes have that "ht" flag.

--

-- 
Rahul

Gmane