Hello , Thank you for your suggestion and I thank also thank Tina;
To answer your question, there is no TreeWidth entry in the slurm.conf
But it seems we figured out the issue .... and I'm so sorry we did not
think about it : we already had a pool of 48 nodes on the master but
their slurm.conf diverged from the ones on the pool of dancing state
nodes; At least, their slurmd was not restarted;
And actually several people suggested that the slurmd need to talk
between each other; That's really our fault; 100 nodes were aware of all
the 148 nodes and 48 nodes were only aware of themselves; I suppose that
created issues to the master;
So even if we also had other issues like interfaces flip flopping, the
diverged slurm.conf was probably the issue.
Thank you all for your help, It is time to compute :)
Jeremy.
On 02/02/2022 16:27, Stephen Cousins wrote:
Hi Jeremy,
What is the value of TreeWidth in your slurm.conf? If there is no
entry then I recommend setting it to a value a bit larger than the
number of nodes you have in your cluster and then restarting slurmctld.
Best,
Steve
On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix
<jeremy....@centralesupelec.fr> wrote:
Hi,
A follow-up. I though some of nodes were ok but that's not the case;
This morning, another pool of consecutive (why consecutive by the
way?
they are always consecutively failing) compute nodes are idle* .
And now
of the nodes which were drained came back to life in idle and now
again
switched to idle*.
One thing I should mention is that the master is now handling a
total of
148 nodes; That's the new pool of 100 nodes which have a cycling
state.
The previous 48 nodes that already handled by this master are ok.
I do not know if this should be considered a large system but we
tried
to have a look to settings such as the ARP cache [1] on the slurm
master. I'm not very familiar with that, it seems to me it
enlarges the
cache of the node names/IPs table. This morning, the master has 125
lines in "arp -a" (before changing the settings in systctl , it was
like, 20 or so); Do you think this settings is also necessary on the
compute nodes ?
Best;
Jeremy.
[1]
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
--
________________________________________________________________
Steve Cousins Supercomputer Engineer/Administrator
Advanced Computing Group University of Maine System
244 Neville Hall (UMS Data Center) (207) 581-3574
Orono ME 04469 steve.cousins at maine.edu
<http://maine.edu>