Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-03 Thread Paul Raines
On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote: On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/drive

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Stephan Roth
On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/driver/nvidia/gpus/*/information The serial number isn't sh

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Michael Di Domenico
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: > The problem is to identify the cards physically from the information we > have, like what's reported with nvidia-smi or available in > /proc/driver/nvidia/gpus/*/information > The serial number isn't shown for every type of GPU and I'm not sure

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-01 Thread Paul Raines
, though, using CUDA_​DEVICE_​ORDER. See https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/ Cheers, Esben From: slurm-users on behalf of Timony, Mick Sent: Monday, January 31, 2022 15:45 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How t

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Stephan Roth
Not a solution, but some ideas & experiences concerning the same topic: A few of our older GPUs used to show the error message "has fallen off the bus" which was only resolved by a full power cycle as well. Something changed, nowadays the error messages is "GPU lost" and a normal reboot reso

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Timony, Mick
I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows:

[slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-30 Thread Paul Raines
I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows: