Hi Diego,

sorry for the delay.


On 10/18/21 14:20, Diego Zuccato wrote:
Il 15/10/2021 06:02, Marcus Wagner ha scritto:

mostly, our problem was, that we forgot to add/remove a node to/from the partitions/topology file, which caused slurmctld to deny startup. So I wrote a simple checker for that. Here is the output of a sample run:
Even "just" catching syntax errors and the most common errors is already a big help, expecially for noobs :)

[OK]: All nodeweights are correct.
What do you mean with this? How can weights be "incorrect"?

We are using nodeweights calculated out of different factors,  like cpu generation, memory, cores and available generic resources. We have e.g. some nodes with additional NVMe disks, these should be scheduled later than the nodes without NVMes, but can be forced for scheduling by asking for the constraint nvme. My checker does calculate these weights, so I do not have to calculate these by myself, just insert the calculated value.
Example output (instead of "[OK]: All nodeweights are correct.")
NodeName=lns[07-08]                                 Sockets=8 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1020000 Feature=broadwell,bwx8860,nvme,hostok,hpcwork Gres=gpu:pascal:1  Weight=111544(was 1) State=UNKNOWN

So, the correct weight is 111544, but I set it to "1" in the configfile. The checker tells me "Weight=111544(was 1)", that the correct value for this kind of node would be 111544 and not "1".

Best
Marcus

If someone is interested ...Surely I am :)


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to