On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote:
On 02.02.22 18:32, Michael Di Domenico wrote:
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth
wrote:
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/drive
On 02.02.22 18:32, Michael Di Domenico wrote:
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/driver/nvidia/gpus/*/information
The serial number isn't sh
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote:
> The problem is to identify the cards physically from the information we
> have, like what's reported with nvidia-smi or available in
> /proc/driver/nvidia/gpus/*/information
> The serial number isn't shown for every type of GPU and I'm not sure
, though, using CUDA_DEVICE_ORDER.
See https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
Cheers,
Esben
From: slurm-users on behalf of Timony, Mick
Sent: Monday, January 31, 2022 15:45
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How t
Not a solution, but some ideas & experiences concerning the same topic:
A few of our older GPUs used to show the error message "has fallen off
the bus" which was only resolved by a full power cycle as well.
Something changed, nowadays the error messages is "GPU lost" and a
normal reboot reso
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to.
So I modified gres.conf on that node as follows:
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to.
So I modified gres.conf on that node as follows: