Hi Alexander,

could you please do a

scontrol show config | grep SelectTypeParameters

and tell us the result?


In fact, for SLURM a CPU is everytimes a CPU, nonetheless, if a thread (with HT) or a core is meant(without HT).
The question is moreover, why SLURM thinks, such a node is not available.
We sometimes also have this phenomenon, we have to restart the slurmcontrolloer to solve that.

But I would first like to see, what

sbatch -vvv jobscript

outputs first. I'm not sure, if it would be meaningful, if the jobs does not get submitted, but it might be a try.


Best
Marcus


On 3/4/20 1:25 PM, Alexander Grund wrote:
> What is your hardware configuration?  Do you have 1 server with 44 processor sockets, and each processor has 4 CPU cores?  Or is it maybe 1 server with 1 or more sockets for a total of 44 CPU cores, and each CPU core is running 4 hyperthreads?

1 server, 2 sockets, 22 cores each, 4 hyperthreads --> 2*22*4=176 "CPUTot" as reported by "scontrol show node"

> I think you should give the relevant node and partition lines from your slurm.conf.

I found the following in node.conf: NodeName=taurusml[1-32] Feature=IB Gres=gpu:6 Procs=176 Sockets=2 CoresPerSocket=22 ThreadsPerCore=4 RealMemory=254000 State=UNKNOWN Weight=128

> Which Slurm version do you run?

19.05.5

> The whypending tool does not appear in a google search. Where did you get it from and what does it do?

It seems to be a Python script showing why a job is pending. It uses pyslurm. I thought it was a slurm tool, but might be some custom thing

> >Most importantly: Does this mean `--cpus-per-task` can be as high as 176 on this node and `--mem-per-cpu` can be up to the reported "RealMemory"/176?
> Yes.

> This is just historical as far as I can tell. I think 'CPU' almost always means 'core'.

I just tried a very simple example with 1 task and `--cpus-per-task=50` (slightly higher than the 44 physical cores) and it failed with "Requested node configuration is not available"


So in summary: "CPU" for the srun/sbatch/salloc means "(physical) core". "CPU" as for scontrol (and pyslurm which seems to wrap this) means "Thread". This is confusing but at least the question seems to be answered now.


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to