We have a 8 GPU server in which one GPU has gone into an error state that will require a reboot to clear. I have jobs on the server running on good GPUs that will take another 3 days to complete. In the meantime, I would
like short jobs to run on the good free GPUs till I reboot.

I set a reservation for the time window I plan to reboot on the whole node with

scontrol create reservation reservationName=rtx-01_reboot users=root
  starttime=2024-11-25T06:00:00 duration=720 Nodes=rtx-01 
flags=maint,ignore_jobs

But I would like to set a reservation on just the bad GPU (gpu_id=7) from now till 2024-11-25T06:00:00 so no job runs that will use it.

Is that possible?

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129            USA



The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to