Hello, Yes it's a bug in the way the reboot rpcs are handled. A fix was recently committed which we have yet to test, but 18.08.5 is meant to repair this (among other things).
Doug On Tue, Jan 22, 2019 at 02:46 Martijn Kruiten <martijn.krui...@surfsara.nl> wrote: > Hi, > > We encounter a strange issue on our system (Slurm 18.08.3), and I'm > curious whether anyone of you recognizes this behavior. In the following > example we try to reboot 32 nodes, of which 31 nodes are idle: > > root# scontrol reboot ASAP nextstate=resume reason=image r8n[1-32] > root# sinfo -o "%100E %9u %19H %N" > > REASON > USER TIMESTAMP NODELIST > image > root 2019-01-21T17:03:49 > r8n32 > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[1-3] > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:47 r8n[4-10] > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[11-15] > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:48 r8n[16-23] > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[24-29] > image : reboot issued : reboot issued : reboot issued : reboot issued : > reboot issued : reboot issue root 2019-01-21T17:03:49 r8n[30-31] > > For as long as the allocated node (r8n32) has not been rebooted, the > "reboot issued" message keeps appending to the reason for all other nodes, > and the ResumeTimeout is ignored. Even worse: the other nodes get stuck in > an endless reboot loop. It seems like they keep getting the instruction to > reboot. As soon as I cancel the reboot for the allocated node, the reboot > loop stops for all other nodes. > > This also happens if we do the reboot command in a loop: > > root# for n in {1..32}; do scontrol reboot ASAP nextstate=resume > reason=image r8n$n; done > > So it seems that Slurm somehow groups all nodes that need to be rebooted > together, and issues reboot commands to them until the last one of them is > ready to reboot. This happens regardless of whether the scontrol command > has been issued for all nodes at once or independently. > > I should add that the command works fine if we need to reboot just one > node, or for couple of nodes that were already idle to begin with. The > RebootProgram is /sbin/reboot, so nothing out of the ordinary. > > Best regards, > > Martijn Kruiten > > -- > > | System Programmer | SURFsara | Science Park 140 | 1098 XG Amsterdam | > | T +31 6 20043417 | martijn.krui...@surfsara.nl > <bas.vandervl...@surfsara.nl> | www.surfsara.nl | > -- Sent from Gmail Mobile