Hello. I need a help with troubleshooting our slurm cluster. I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud infrastructure (Jetstream) using an elastic computing mechanism ( https://slurm.schedmd.com/elastic_computing.html). Our cluster works for the most part, but for some reason, a few of our nodes constantly goes into "down" state.
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST cloud* up 2-00:00:00 1-infinite no YES:4 all 10 idle~ slurm9-compute[1-5,10,12-15] cloud* up 2-00:00:00 1-infinite no YES:4 all 5 down slurm9-compute[6-9,11] The only log I see in the slurm log is this.. [2021-07-30T15:10:55.889] Invalid node state transition requested for node slurm9-compute6 from=COMPLETING to=RESUME [2021-07-30T15:21:37.339] Invalid node state transition requested for node slurm9-compute6 from=COMPLETING* to=RESUME [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to: completing [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to DOWN [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to IDLE .. [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not responding, setting DOWN WIth elastic computing, any unused nodes are automatically removed (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are *expected* to not respond once they are removed, but they should not be marked as DOWN. They should simply be set to "idle". To work around this issue, I am running the following cron job. 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume This "works" somewhat.. but our nodes go to "DOWN" state so often that running this every hour is not enough. Here is the full content of our slurm.conf root@slurm9:~# cat /etc/slurm-llnl/slurm.conf ClusterName=slurm9 ControlMachine=slurm9 SlurmUser=slurm SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/tmp SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid ProctrackType=proctrack/pgid ReturnToService=1 Prolog=/usr/local/sbin/slurm_prolog.sh # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 #make slurm a little more tolerant here MessageTimeout=30 TCPTimeout=15 BatchStartTimeout=20 GetEnvTimeout=20 InactiveLimit=0 MinJobAge=604800 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory #FastSchedule=0 # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log JobCompType=jobcomp/none # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/filetxt AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log #CLOUD CONFIGURATION PrivateData=cloud ResumeProgram=/usr/local/sbin/slurm_resume.sh SuspendProgram=/usr/local/sbin/slurm_suspend.sh ResumeRate=1 #number of nodes per minute that can be created; 0 means no limit ResumeTimeout=900 #max time in seconds between ResumeProgram running and when the node is ready for use SuspendRate=1 #number of nodes per minute that can be suspended/destroyed SuspendTime=600 #time in seconds before an idle node is suspended SuspendTimeout=300 #time between running SuspendProgram and the node being completely down TreeWidth=30 NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388 PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES MaxTime=48:00:00 State=UP Shared=YES I appreciate your assistance! Soichi Hayashi Indiana University