On 8/3/22 8:37 am, Phil Chiu wrote:
Therefore my problem is this - "Reboot all nodes, permitting N nodes to
be rebooting simultaneously."
I think currently the only way to do that would be to have a script that
does:
* issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes
* wait for 1 to come back to being online
* issue an `scontrol reboot` for another node
* wait for 1 more to come back
* lather rinse repeat.
This does assume you've got your nodes configured to come back cleanly
on a reboot with slurmd up and no manual intervention required (which is
what we do).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA