Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Tina Friedrich Thu, 04 Aug 2022 03:26:31 -0700

I'm thinking something like that currently - setting up some kind ofTRES resource that limits how many are rebooted at any one time.

I usually do this sort of thing more or less manually; as in, Igenerated a list of sbatch commands with the reboot job (one job pernode, specifying node name) - ordered to my liking (making sure I alwayshave GPUs of type X available, that sort of thing) - and then submittedthat in batches, waiting for one batch to finish before the next goes in.


Tina

On 04/08/2022 06:20, Gerhard Strangar wrote:

Phil Chiu wrote:

    - Individual slurm jobs which reboot nodes - With a for loop, I could
    submit a reboot job for each node. But I'm not sure how to limit this so at
    most N jobs are running simultaneously.


With a fake license called reboot?


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Reply via email to