Hi Diego,
sorry for the delay.
On 10/18/21 14:20, Diego Zuccato wrote:
Il 15/10/2021 06:02, Marcus Wagner ha scritto:
mostly, our problem was, that we forgot to add/remove a node to/from
the partitions/topology file, which caused slurmctld to deny startup.
So I wrote a simple checker for that. Here is the output of a sample
run:
Even "just" catching syntax errors and the most common errors is
already a big help, expecially for noobs :)
[OK]: All nodeweights are correct.
What do you mean with this? How can weights be "incorrect"?
We are using nodeweights calculated out of different factors, like cpu
generation, memory, cores and available generic resources.
We have e.g. some nodes with additional NVMe disks, these should be
scheduled later than the nodes without NVMes, but can be forced for
scheduling by asking for the constraint nvme.
My checker does calculate these weights, so I do not have to calculate
these by myself, just insert the calculated value.
Example output (instead of "[OK]: All nodeweights are correct.")
NodeName=lns[07-08] Sockets=8
CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1020000
Feature=broadwell,bwx8860,nvme,hostok,hpcwork Gres=gpu:pascal:1
Weight=111544(was 1) State=UNKNOWN
So, the correct weight is 111544, but I set it to "1" in the configfile.
The checker tells me "Weight=111544(was 1)", that the correct value for
this kind of node would be 111544 and not "1".
Best
Marcus
If someone is interested ...Surely I am :)
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de