Lance Westerhoff | 26 Oct 17:31 2011

torque/maui disregarding pmem with procs


Hello all-

(I sent this email to the torque list, but I'm wondering if it might be a maui problem).

We are trying to use procs= and pmem= on an 18 node (152core) cluster with nodes of various memory size.
pbsnodes shows the correct memory complement for each node, so apparently PBS is getting the right specs
(see the output of pbsnodes below for more information). If we use the following settings in the PBS
script, invariably torque/maui will try to fill up the all 8 of the 8 cores of each node. That is even though
there is nowhere near enough memory on any of these nodes for 8*3700mb=29600mb. Considering the physical
memory limit goes from 8GB to 24GB depending upon the node, this is just taking down nodes left and right.

Below I have provided a small example along with the associated output. I also provided the output for
pbsnodes in case there is something I am missing here.

Thanks for your help!  -Lance

torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same problem.
maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it
only asks for a single CPU)

$ cat tmp.pbs
#!/bin/bash
#PBS -S /bin/bash
#PBS -l procs=24
#PBS -l pmem=3700mb
#PBS -l walltime=6:00:00 
#PBS -j oe

cat $PBS_NODEFILE
(Continue reading)

Gareth.Williams | 27 Oct 01:07 2011
Picon
Picon

Re: torque/maui disregarding pmem with procs

Hi Lance,

Does maui locate appropriate nodes if you specify:
-l procs=24,vmem=29600mb
?
That's what I'd do.  It will not limit the memory per process (loosely speaking) but the main problem is which
nodes are allocated.

Gareth

> -----Original Message-----
> From: Lance Westerhoff [mailto:lance <at> quantumbioinc.com]
> Sent: Thursday, 27 October 2011 2:31 AM
> To: mauiusers <at> supercluster.org
> Subject: [Mauiusers] torque/maui disregarding pmem with procs
> 
> 
> Hello all-
> 
> (I sent this email to the torque list, but I'm wondering if it might be
> a maui problem).
> 
> We are trying to use procs= and pmem= on an 18 node (152core) cluster
> with nodes of various memory size. pbsnodes shows the correct memory
> complement for each node, so apparently PBS is getting the right specs
> (see the output of pbsnodes below for more information). If we use the
> following settings in the PBS script, invariably torque/maui will try
> to fill up the all 8 of the 8 cores of each node. That is even though
> there is nowhere near enough memory on any of these nodes for
> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
(Continue reading)

Lance Westerhoff | 27 Oct 06:15 2011

Re: torque/maui disregarding pmem with procs


Hi Gareth-

The vmem and pvmem options doesn't seem to make any difference. 

Incidentally, since the different nodes have different size memory compliments, and I will be
distributing multi-processor jobs across nodes, I absolutely need something that will allocate on a
per-processor memory basis. Here's a more detailed rundown of what is available in the cluster:

$ diagnose -n
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

compute-0-16             Busy   0:8      489:7985        1:1        8341:9985    1.00  linux [NONE] DEF   8.02 005
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-15             Busy   0:8     1185:7985        1:1        9455:9985    1.00  linux [NONE] DEF   8.03 009
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-14             Busy   0:8     1185:7985        1:1        9453:9985    1.00  linux [NONE] DEF   8.00 001
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-13             Busy   0:8     1185:7985        1:1        9494:9985    1.00  linux [NONE] DEF   8.00 001
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-12             Busy   0:8     5213:12013       1:1       13340:14013   1.00  linux [NONE] DEF   8.04 001
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-11             Busy   0:8     5213:12013       1:1       11338:14013   1.00  linux [NONE] DEF   8.01 008
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-9              Busy   0:8     5213:12013       1:1       11402:14013   1.00  linux [NONE] DEF   8.00 008
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-8              Busy   0:8     5213:12013       1:1       11379:14013   1.00  linux [NONE] DEF   8.00 008
[developer_8:8][lowprio_8:8][b [DEFAULT]                      [NONE]              
compute-0-7              Busy   0:8     5213:12013       1:1       11391:14013   1.00  linux [NONE] DEF   8.01 008
(Continue reading)

Ian Miller | 7 Nov 23:44 2011

Strange queue/scheduler issue

Not sure if this is the correct forum for this but
We have  320 core Grid with Maui & torque running.  Three queues are setup
up with two nodes (24 core) for one of them and another with two
exclusively and three nodes sharing the default queue.
When someone submit say 4000 jobs to the default queue.  No one can submit
any jobs to either of the other queue.  They just sit in Q status.   This
started about three days ago and the users and total in an uproar about
it.  

Any thought would on where to find the bottle neck of a config setting
would be helpful. 

-I 

Ian Miller
System Administrator
ianm <at> uchicago.edu
312-282-6507

On 10/26/11 6:07 PM, "Gareth.Williams <at> csiro.au" <Gareth.Williams <at> csiro.au>
wrote:

>Hi Lance,
>
>Does maui locate appropriate nodes if you specify:
>-l procs=24,vmem=29600mb
>?
>That's what I'd do.  It will not limit the memory per process (loosely
>speaking) but the main problem is which nodes are allocated.
>
(Continue reading)

Roy Dragseth | 7 Nov 23:56 2011
Picon
Picon

Re: Strange queue/scheduler issue

Have you tried to set a limitation on the number of Idle jobs allowed per 
user?

For instance

USRCFG[DEFAULT] MAXIJOB=10

r.

On Monday 7. November 2011 23.44.47 Ian Miller wrote:
> Not sure if this is the correct forum for this but
> We have  320 core Grid with Maui & torque running.  Three queues are setup
> up with two nodes (24 core) for one of them and another with two
> exclusively and three nodes sharing the default queue.
> When someone submit say 4000 jobs to the default queue.  No one can submit
> any jobs to either of the other queue.  They just sit in Q status.   This
> started about three days ago and the users and total in an uproar about
> it.
> 
> Any thought would on where to find the bottle neck of a config setting
> would be helpful.
> 
> -I
> 
> 
> 
> Ian Miller
> System Administrator
> ianm <at> uchicago.edu
> 312-282-6507
(Continue reading)

Steve Crusan | 8 Nov 00:02 2011
Picon

Re: Strange queue/scheduler issue


On Nov 7, 2011, at 5:56 PM, Roy Dragseth wrote:

> Have you tried to set a limitation on the number of Idle jobs allowed per 
> user?
> 
> For instance
> 
> USRCFG[DEFAULT] MAXIJOB=10
> 

Correct above. Also, you can define the max amount of jobs queued via TORQUE, I think max_user_queuable.

I've run into a situation where a user submitted too many jobs for the scheduler to handle (job wouldn't even
be given a state), and I was required to set a something in MAUI for max jobs. 

Try Roy's fix above first though. The TORQUE limit is great for a hard and easy to see user limit, i.e. they get
feedback when they break the rules.

> r.
> 
> 
> On Monday 7. November 2011 23.44.47 Ian Miller wrote:
>> Not sure if this is the correct forum for this but
>> We have  320 core Grid with Maui & torque running.  Three queues are setup
>> up with two nodes (24 core) for one of them and another with two
>> exclusively and three nodes sharing the default queue.
>> When someone submit say 4000 jobs to the default queue.  No one can submit
>> any jobs to either of the other queue.  They just sit in Q status.   This
>> started about three days ago and the users and total in an uproar about
(Continue reading)

James A. Peltier | 8 Nov 02:26 2011
Picon
Picon

Re: Strange queue/scheduler issue

----- Original Message -----
| Not sure if this is the correct forum for this but
| We have 320 core Grid with Maui & torque running. Three queues are
| setup
| up with two nodes (24 core) for one of them and another with two
| exclusively and three nodes sharing the default queue.
| When someone submit say 4000 jobs to the default queue. No one can
| submit
| any jobs to either of the other queue. They just sit in Q status. This
| started about three days ago and the users and total in an uproar
| about
| it.
| 
| Any thought would on where to find the bottle neck of a config setting
| would be helpful.
| 
| -I

I think you are looking for this.

http://www.adaptivecomputing.com/resources/docs/maui/a.ddevelopment.php

Specifically...

Value  : MMAX_JOB
File   : moab.h
Default: 4096

maximum total number of simultaneous idle/active jobs allowed.

(Continue reading)


Gmane