Hi all, (configuration and scripts below text)
I have configured SLURM to power down idle nodes but it probably is misconfigured. I aim for a configuration where after a certain period (say 10min) idle nodes are powered down. As you can see from the configuration below I have SLURM call either "node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts that handle the conversion of SLURM's nodelist syntax and call "node_poweroff" or "node_poweron" for each node. "node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log so I can follow and in the future analyze which nodes were turned off and on. The current situation is that although I see 36 out of 54 nodes in a IDLE+POWER state all nodes are powered on and accessible via SSH. Output from "grep -i power /var/log/slurm/slurmctld.log | tail" [2014-08-28T12:01:24.975] Power save mode: 30 nodes [2014-08-28T12:11:44.080] Power save mode: 30 nodes [2014-08-28T12:22:44.194] Power save mode: 30 nodes [2014-08-28T12:33:44.306] Power save mode: 30 nodes [2014-08-28T12:44:01.425] Power save mode: 30 nodes [2014-08-28T12:51:44.514] power_save: suspending nodes n[510301,510601,511901] [2014-08-28T12:54:26.547] Power save mode: 33 nodes [2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501] [2014-08-28T12:57:08.581] power_save: suspending nodes n510901 [2014-08-28T13:05:10.666] Power save mode: 36 nodes Output from "tail /var/log/slurm/powermgmt.log" 2014-08-27 16:39:36 power on n512501 2014-08-27 16:51:17 power on n512601 2014-08-27 17:59:38 power on n512601 2014-08-28 09:05:54 power on n511101 2014-08-28 09:06:05 power on n511201 2014-08-28 09:06:11 power on n512001 2014-08-28 09:06:19 power on n512201 2014-08-28 10:41:51 power on n510501 2014-08-28 10:41:51 power on n510701 2014-08-28 11:31:41 power on n511101 grep does not find "down" in /var/log/slurm/powermgmt.log which it should if "node_poweroff" has been executed. My impression is that something (misconfiguration? bad sudo configuration? other right stuff?) doesn't allow SLURM to execute one of the mentioned scripts. Can someone check my configuration and give some advice on how to debug this issue further? Thank you, Uwe ### slurm.conf excerpt ### # POWER SAVE SUPPORT FOR IDLE NODES (optional) SuspendTime=600 SuspendRate=30 ResumeRate=10 SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm SuspendTimeout=120 ResumeTimeout=300 #SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01] #SuspendExcParts= BatchStartTimeout=60 ########################## ### /opt/system/slurm/etc/node_poweroff.slurm ### #!/bin/bash set -o nounset NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1) for NODE in ${NODES}; do sudo /opt/system/slurm/etc/node_poweroff ${NODE} done exit 0 ################################################# ### /opt/system/slurm/etc/node_poweron.slurm ### #!/bin/bash set -o nounset NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1) for NODE in ${NODES}; do /opt/system/slurm/etc/node_poweron ${NODE} done ################################################# ### /opt/system/slurm/etc/node_poweroff ### #!/bin/bash set -o nounset NODE=$1 echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log ssh ${NODE} "/etc/init.d/lustre_client stop" ssh ${NODE} "umount /localscratch /nfs/*" ssh ${NODE} "service slurm stop" ssh ${NODE} "service munge stop" ssh ${NODE} "poweroff" sleep 10 ping -c1 ${NODE} >/dev/null 2>&1 [ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power off exit 0 ############################################# ### /opt/system/slurm/etc/node_poweron ### #!/bin/bash set -o nounset NODE=${1} echo "$(date +'%F %T') power on ${NODE}" >> /var/log/slurm/powermgmt.log /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on exit 0 ########################################## ### /etc/sudoers excerpt ### slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff ############################