Hola, Just looking for some clarification of the nature of job suspension.
I have just had some success with it and am confirming that it was as wonderful as it seemed? Scenario: about to switch from dev to prod, boss asks me to reboot whole cluster to show stakeholders that it will come back up smoothly. Since we had a long (48 hour) job running, I suspended that job and pushed the reboot command to all worker nodes, head node and submission node the via ansible. Head, being a VM, was back up almost immediately. The other nodes are all hardware and take a bit longer. I was pleased to see the suspended job still there. After a minute I resumed the job because our SlurmctlTimeout is too short (120 seconds) and I didn't want to hit the timeout waiting for the nodes to come back up. Interesting for me was that the node in particular that the job had been running on hadn't come up smoothly - and slurmctld seems to have just moved the job off onto a node that had come up fine. I had to jump into the hardware interface to jump start the errant node, so it was down for about 5 mins. Anyway - this is a wonderful result, thank you. The questions that were then asked of me: - is the switching of nodes expected behaviour? - how does slurm put jobs into suspended mode given that some may have large amounts of data in memory? Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper
