On 2/20/19 12:08 AM, Marcus Wagner wrote:

Hi Prentice,


On 2/19/19 2:58 PM, Prentice Bisbal wrote:

--ntasks-per-node is meant to be used in conjunction with --nodes option. From https://slurm.schedmd.com/sbatch.html:

*--ntasks-per-node*=</ntasks/>
    Request that /ntasks/ be invoked on each node. If used with the
    *--ntasks* option, the *--ntasks* option will take precedence
    and the *--ntasks-per-node* will be treated as a /maximum/ count
    of tasks per node. Meant to be used with the *--nodes* option...

Yes, but used together with --ntasks would mean to use e.g. 48 tasks at maximum per node. I don't see, where there lies the difference regarding submission of the job.  Even if the semantic (how or how many cores will be scheduled onto which number of hosts) might be incorrect, at least the syntax should be correct.

The difference would be in how Slurm looks at those specifications internally. To us humans, what you say should work seems logical, but if Slurm wasn't programmed to behave that way, it won't. I provided the quote from the documentation, since that implies, to me at least, that Slurm isn't programmed to behave like that. Looking at the source code or asking SchedMD could confirm that.


If you don't specify --ntasks, it defaults to --ntasks=1, as Andreas said. https://slurm.schedmd.com/sbatch.html:

*-n*, *--ntasks*=</number/>
    sbatch does not launch tasks, it requests an allocation of
    resources and submits a batch script. This option advises the
    Slurm controller that job steps run within the allocation will
    launch a maximum of /number/ tasks and to provide for sufficient
    resources. The default is one task per node, but note that the
*--cpus-per-task* option will change this default.
So the correct way to specify your job is either like this

--ntasks=48

or

--nodes=1 --ntasks-per-node=48

Specifying both --ntasks-per-node and --ntasks at the same time is not correct.

funnily the result is the same:

$> sbatch -N 1 --ntasks-per-node=48 --wrap hostname
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

whereas just using --ntasks=48 gets submitted and it gets scheduled onto one host:

$> sbatch --ntasks=48 --wrap hostname
sbatch: [I] No output file given, set to: output_%j.txt
sbatch: [I] No runtime limit given, set to: 15 minutes
Submitted batch job 199784
$> scontrol show job 199784 | egrep "NumNodes|TRES"
   NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=182400M,node=1,billing=48

To me, this still looks like a bug, not like the wrong usage of submission parameters.

Either a bug, or there's something subtly wrong with your slurm.conf. I would continue troubleshooting by simplifying both your node definition and SelectType options as much as possible, and see if the problem still persists. Also, look at 'scontrol show node <node name>' to see if your definition in slurm.conf lines up with how Slurm actually sees the node. I don't think I saw that output anywhere is this thread yet.


Does no one else use nodes in this shared way?
If nodes are shared, do you schedule by hardware threads or by cores?
If you schedule by cores, how did you implement this in slurm?


Best
Marcus


Prentice
On 2/14/19 1:09 AM, Henkel, Andreas wrote:
Hi Marcus,

What just came to my mind: if you don’t set —ntasks isn’t the default just 1? All 
examples I know using ntasks-per-node also set ntasks with ntasks >= 
ntasks-per-node.

Best,
Andreas

Am 14.02.2019 um 06:33 schrieb Marcus Wagner<wag...@itc.rwth-aachen.de>:

Hi all,

I have narrowed this down a little bit.

the really astonishing thing is, that if I use

--ntasks=48

I can submit the job, it will be scheduled onto one host:

    NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=48,mem=182400M,node=1,billing=48

but as soon as I change --ntasks to --ntasks-per-node (which should be the 
same, as --ntasks=48 schedules onto one host), I get the error:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not 
available


Is there no one else, who observes this behaviour?
Any explanations?


Best
Marcus


On 2/13/19 1:48 PM, Marcus Wagner wrote:
Hi all,

I have a strange behaviour here.
We are using slurm 18.08.5-2 on CentOS 7.6.

Let me first describe our computenodes:
NodeName=ncm[0001-1032]  CPUs=48  Sockets=4 CoresPerSocket=12 ThreadsPerCore=2 
RealMemory=185000 Feature=skx8160,hostok,hpcwork                        
Weight=10541 State=UNKNOWN

we have the following config set:

$>scontrol show config | grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE


So, I have 48 cores on one node. According to the manpage of sbatch, I should 
be able to do the following:

#SBATCH --ntasks=48
#SBATCH --ntasks-per-node=48

But I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not 
available


Has anyone an explanation for this?


Best
Marcus

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Reply via email to