[slurm-dev] Suspend job/reboot slurm

Lachlan Musicman Sun, 24 Jul 2016 19:53:41 -0700

Hola,

Just looking for some clarification of the nature of job suspension.


I have just had some success with it and am confirming that it was as
wonderful as it seemed?

Scenario: about to switch from dev to prod, boss asks me to reboot whole
cluster to show stakeholders that it will come back up smoothly.

Since we had a long (48 hour) job running, I suspended that job and pushed
the reboot command to all worker nodes, head node and submission node the
via ansible.

Head, being a VM, was back up almost immediately. The other nodes are all
hardware and take a bit longer. I was pleased to see the suspended job
still there.

After a minute I resumed the job because our SlurmctlTimeout is too short
(120 seconds) and I didn't want to hit the timeout waiting for the nodes to
come back up.

Interesting for me was that the node in particular that the job had been
running on hadn't come up smoothly - and slurmctld seems to have just moved
the job off onto a node that had come up fine.

I had to jump into the hardware interface to jump start the errant node, so
it was down for about 5 mins.

Anyway - this is a wonderful result, thank you.

The questions that were then asked of me:
 - is the switching of nodes expected behaviour?
 - how does slurm put jobs into suspended mode given that some may have
large amounts of data in memory?

Cheers
L.




------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

[slurm-dev] Suspend job/reboot slurm

Reply via email to