Hi Alexander,
could you please do a
scontrol show config | grep SelectTypeParameters
and tell us the result?
In fact, for SLURM a CPU is everytimes a CPU, nonetheless, if a thread
(with HT) or a core is meant(without HT).
The question is moreover, why SLURM thinks, such a node is not available.
We sometimes also have this phenomenon, we have to restart the
slurmcontrolloer to solve that.
But I would first like to see, what
sbatch -vvv jobscript
outputs first. I'm not sure, if it would be meaningful, if the jobs does
not get submitted, but it might be a try.
Best
Marcus
On 3/4/20 1:25 PM, Alexander Grund wrote:
> What is your hardware configuration? Do you have 1 server with 44
processor sockets, and each processor has 4 CPU cores? Or is it maybe
1 server with 1 or more sockets for a total of 44 CPU cores, and each
CPU core is running 4 hyperthreads?
1 server, 2 sockets, 22 cores each, 4 hyperthreads --> 2*22*4=176
"CPUTot" as reported by "scontrol show node"
> I think you should give the relevant node and partition lines from
your slurm.conf.
I found the following in node.conf: NodeName=taurusml[1-32] Feature=IB
Gres=gpu:6 Procs=176 Sockets=2 CoresPerSocket=22 ThreadsPerCore=4
RealMemory=254000 State=UNKNOWN Weight=128
> Which Slurm version do you run?
19.05.5
> The whypending tool does not appear in a google search. Where did
you get it from and what does it do?
It seems to be a Python script showing why a job is pending. It uses
pyslurm. I thought it was a slurm tool, but might be some custom thing
> >Most importantly: Does this mean `--cpus-per-task` can be as high
as 176 on this node and `--mem-per-cpu` can be up to the reported
"RealMemory"/176?
> Yes.
> This is just historical as far as I can tell. I think 'CPU' almost
always means 'core'.
I just tried a very simple example with 1 task and
`--cpus-per-task=50` (slightly higher than the 44 physical cores) and
it failed with "Requested node configuration is not available"
So in summary: "CPU" for the srun/sbatch/salloc means "(physical)
core". "CPU" as for scontrol (and pyslurm which seems to wrap this)
means "Thread". This is confusing but at least the question seems to
be answered now.
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de