Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Jeremy Fix Tue, 01 Feb 2022 21:58:12 -0800

Hi,

A follow-up. I though some of nodes were ok but that's not the case;This morning, another pool of consecutive (why consecutive by the way?they are always consecutively failing) compute nodes are idle* . And nowof the nodes which were drained came back to life in idle and now againswitched to idle*.

One thing I should mention is that the master is now handling a total of148 nodes; That's the new pool of 100 nodes which have a cycling state.The previous 48 nodes that already handled by this master are ok.

I do not know if this should be considered a large system but we triedto have a look to settings such as the ARP cache [1] on the slurmmaster. I'm not very familiar with that, it seems to me it enlarges thecache of the node names/IPs table. This morning, the master has 125lines in "arp -a" (before changing the settings in systctl , it waslike, 20 or so); Do you think this settings is also necessary on thecompute nodes ?


Best;

Jeremy.

[1]https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Reply via email to