Hello. I need a help with troubleshooting our slurm cluster.

I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
infrastructure (Jetstream) using an elastic computing mechanism (
https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
the most part, but for some reason, a few of our nodes constantly goes into
"down" state.

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES
STATE NODELIST
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all     10
idle~ slurm9-compute[1-5,10,12-15]
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all      5
 down slurm9-compute[6-9,11]

The only log I see in the slurm log is this..

[2021-07-30T15:10:55.889] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING to=RESUME
[2021-07-30T15:21:37.339] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING* to=RESUME
[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to:
completing
[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
DOWN
[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
IDLE
..
[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
responding, setting DOWN

WIth elastic computing, any unused nodes are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
*expected* to not respond once they are removed, but they should not be
marked as DOWN. They should simply be set to "idle".

To work around this issue, I am running the following cron job.

0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume

This "works" somewhat.. but our nodes go to "DOWN" state so often that
running this every hour is not enough.

Here is the full content of our slurm.conf

root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
ClusterName=slurm9
ControlMachine=slurm9

SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=1
Prolog=/usr/local/sbin/slurm_prolog.sh

#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
#make slurm a little more tolerant here
MessageTimeout=30
TCPTimeout=15
BatchStartTimeout=20
GetEnvTimeout=20
InactiveLimit=0
MinJobAge=604800
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#FastSchedule=0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log

#CLOUD CONFIGURATION
PrivateData=cloud
ResumeProgram=/usr/local/sbin/slurm_resume.sh
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
ResumeRate=1 #number of nodes per minute that can be created; 0 means no
limit
ResumeTimeout=900 #max time in seconds between ResumeProgram running and
when the node is ready for use
SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
SuspendTime=600 #time in seconds before an idle node is suspended
SuspendTimeout=300 #time between running SuspendProgram and the node being
completely down
TreeWidth=30

NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
MaxTime=48:00:00 State=UP Shared=YES

I appreciate your assistance!

Soichi Hayashi
Indiana University

Reply via email to