Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Jeremy Fix Wed, 02 Feb 2022 10:58:33 -0800

Hello , Thank you for your suggestion and I thank also thank Tina;


To answer your question, there is no TreeWidth entry in the slurm.conf

But it seems we figured out the issue .... and I'm so sorry we did notthink about it : we already had a pool of 48 nodes on the master buttheir slurm.conf diverged from the ones on the pool of dancing statenodes; At least, their slurmd was not restarted;

And actually several people suggested that the slurmd need to talkbetween each other; That's really our fault; 100 nodes were aware of allthe 148 nodes and 48 nodes were only aware of themselves; I suppose thatcreated issues to the master;

So even if we also had other issues like interfaces flip flopping, thediverged slurm.conf was probably the issue.


Thank you all for your help, It is time to compute :)

Jeremy.


On 02/02/2022 16:27, Stephen Cousins wrote:

Hi Jeremy,

What is the value of TreeWidth in your slurm.conf? If there is noentry then I recommend setting it to a value a bit larger than thenumber of nodes you have in your cluster and then restarting slurmctld.


Best,

Steve

On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix<jeremy....@centralesupelec.fr> wrote:


    Hi,

    A follow-up. I though some of nodes were ok but that's not the case;
    This morning, another pool of consecutive (why consecutive by the
    way?
    they are always consecutively failing) compute nodes are idle* .
    And now
    of the nodes which were drained came back to life in idle and now
    again
    switched to idle*.

    One thing I should mention is that the master is now handling a
    total of
    148 nodes; That's the new pool of 100 nodes which have a cycling
    state.
    The previous 48 nodes that already handled by this master are ok.

    I do not know if this should be considered a large system but we
    tried
    to have a look to settings such as the ARP cache [1] on the slurm
    master. I'm not very familiar with that, it seems to me it
    enlarges the
    cache of the node names/IPs table. This morning, the master has 125
    lines in "arp -a" (before changing the settings in systctl , it was
    like, 20 or so); Do you think  this settings is also necessary on the
    compute nodes ?

    Best;

    Jeremy.


    [1]
    
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks






--
________________________________________________________________
 Steve Cousins Supercomputer Engineer/Administrator
 Advanced Computing Group           University of Maine System
 244 Neville Hall (UMS Data Center)              (207) 581-3574

Orono ME 04469 steve.cousins at maine.edu<http://maine.edu>

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

Reply via email to