Hi Jeremy,
I haven't got anything very intelligent to contribute to solve your problem.
However, what I can tell you is that we run our production cluster with
one SLURM master running on a virtual machine handling just over 300
nodes. We have never seen the sort of problem you have other than when
there was a problem contacting the nodes.
The VM running slurmctld doesn't get any tuning, it's a stock CentOS 8
server. No increase any caching (ARP or otherwise) on the master. I just
checked and I don't even think I'm doing anything special about process
or memory limits for the user the SLURM proceses run as.
I have - from time to time - have the controller go unresponsive for a
moment, but that's usually to do with lots of prologs/epilogs happening
at the same time, and it does not cause node status to flap like that.
So unless you have indications of load on your master being very high or
memory pressure on the master, I wouldn't suspect the master not coping
for this.
(I don't do host files, I use DNS. But that really shouldn't make a
difference.)
A lot of people have said name resolution - and yes, that could - but
I'm actually also wondering if you might have a network problem
somewhere? Ethernet, I mean? Congestion, or corrupted packages?
Multipathing or path failover or spanning tree going wrong or flapping?
Tina
On 02/02/2022 05:56, Jeremy Fix wrote:
Hi,
A follow-up. I though some of nodes were ok but that's not the case;
This morning, another pool of consecutive (why consecutive by the way?
they are always consecutively failing) compute nodes are idle* . And now
of the nodes which were drained came back to life in idle and now again
switched to idle*.
One thing I should mention is that the master is now handling a total of
148 nodes; That's the new pool of 100 nodes which have a cycling state.
The previous 48 nodes that already handled by this master are ok.
I do not know if this should be considered a large system but we tried
to have a look to settings such as the ARP cache [1] on the slurm
master. I'm not very familiar with that, it seems to me it enlarges the
cache of the node names/IPs table. This morning, the master has 125
lines in "arp -a" (before changing the settings in systctl , it was
like, 20 or so); Do you think this settings is also necessary on the
compute nodes ?
Best;
Jeremy.
[1]
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk