We have a 8 GPU server in which one GPU has gone into an error state that
will require a reboot to clear. I have jobs on the server running on good
GPUs that will take another 3 days to complete. In the meantime, I would
like short jobs to run on the good free GPUs till I reboot.
I set a reservation for the time window I plan to reboot on the whole node
with
scontrol create reservation reservationName=rtx-01_reboot users=root
starttime=2024-11-25T06:00:00 duration=720 Nodes=rtx-01
flags=maint,ignore_jobs
But I would like to set a reservation on just the bad GPU (gpu_id=7) from
now till 2024-11-25T06:00:00 so no job runs that will use it.
Is that possible?
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com