Hi all,

(configuration and scripts below text)

I have configured SLURM to power down idle nodes but it probably is
misconfigured. I aim for a configuration where after a certain period
(say 10min) idle nodes are powered down.

As you can see from the configuration below I have SLURM call either
"node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
that handle the conversion of SLURM's nodelist syntax and call
"node_poweroff" or "node_poweron" for each node.

"node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
so I can follow and in the future analyze which nodes were turned off
and on.

The current situation is that although I see 36 out of 54 nodes in a
IDLE+POWER state all nodes are powered on and accessible via SSH.

Output from "grep -i power /var/log/slurm/slurmctld.log | tail"

[2014-08-28T12:01:24.975] Power save mode: 30 nodes
[2014-08-28T12:11:44.080] Power save mode: 30 nodes
[2014-08-28T12:22:44.194] Power save mode: 30 nodes
[2014-08-28T12:33:44.306] Power save mode: 30 nodes
[2014-08-28T12:44:01.425] Power save mode: 30 nodes
[2014-08-28T12:51:44.514] power_save: suspending nodes
n[510301,510601,511901]
[2014-08-28T12:54:26.547] Power save mode: 33 nodes
[2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
[2014-08-28T12:57:08.581] power_save: suspending nodes n510901
[2014-08-28T13:05:10.666] Power save mode: 36 nodes

Output from "tail /var/log/slurm/powermgmt.log"

2014-08-27 16:39:36 power on   n512501
2014-08-27 16:51:17 power on   n512601
2014-08-27 17:59:38 power on   n512601
2014-08-28 09:05:54 power on   n511101
2014-08-28 09:06:05 power on   n511201
2014-08-28 09:06:11 power on   n512001
2014-08-28 09:06:19 power on   n512201
2014-08-28 10:41:51 power on   n510501
2014-08-28 10:41:51 power on   n510701
2014-08-28 11:31:41 power on   n511101

grep does not find "down" in /var/log/slurm/powermgmt.log which it
should if "node_poweroff" has been executed.

My impression is that something (misconfiguration? bad sudo
configuration? other right stuff?) doesn't allow SLURM to execute one of
the mentioned scripts.

Can someone check my configuration and give some advice on how to debug
this issue further?


Thank you,

        Uwe


### slurm.conf excerpt ###

# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendTime=600
SuspendRate=30
ResumeRate=10
SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
SuspendTimeout=120
ResumeTimeout=300
#SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
#SuspendExcParts=
BatchStartTimeout=60

##########################

### /opt/system/slurm/etc/node_poweroff.slurm ###

#!/bin/bash
set -o nounset

NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)

for NODE in ${NODES}; do
  sudo /opt/system/slurm/etc/node_poweroff ${NODE}
done

exit 0

#################################################

### /opt/system/slurm/etc/node_poweron.slurm ###

#!/bin/bash
set -o nounset

NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)

for NODE in ${NODES}; do
  /opt/system/slurm/etc/node_poweron ${NODE}
done

#################################################

### /opt/system/slurm/etc/node_poweroff ###

#!/bin/bash
set -o nounset

NODE=$1

echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log

ssh ${NODE} "/etc/init.d/lustre_client stop"
ssh ${NODE} "umount /localscratch /nfs/*"
ssh ${NODE} "service slurm stop"
ssh ${NODE} "service munge stop"
ssh ${NODE} "poweroff"

sleep 10

ping -c1 ${NODE} >/dev/null 2>&1
[ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
${NODE}-bmc power off

exit 0

#############################################

### /opt/system/slurm/etc/node_poweron ###

#!/bin/bash
set -o nounset

NODE=${1}

echo "$(date +'%F %T') power on   ${NODE}" >> /var/log/slurm/powermgmt.log

/usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on

exit 0


##########################################

### /etc/sudoers excerpt ###

slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff

############################

Reply via email to