On Thursday, 22 March 2018 2:01:02 AM AEDT Michael Jennings wrote: > As you can see from > https://github.com/mej/nhc/blob/master/helpers/node-mark-offline#L55 > starting at line #61, NHC uses "sinfo -o '%t %E' -hn $HOSTNAME" to get > the current node's status.
At ${JIOB-1} our health check scripts were decoupled from Slurm and run from cron. They wrote their status into a file in /dev/shm on successful completion so Slurm could just poll that - the idea being to try and reduce the chance the check would hang due a system issue and stop slurmd responding. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC