Hi Rob,
Yes, those questions make sense. From what I understand, MIG should
essentially split the GPU so that they behave as separate cards. Hence
two different users should be able to use two different MIG instances at
the same time and also a single job could use all 14 instances. The
resu
We’re seeing a repeated issue of long-running node allocations eventually
disallowing SSH connections.
Our cluster configures the pam_slurm_adopt module, in order to allow users to
access nodes they’ve allocated before. However, even if this allocated node is
idle, after around 24 hours (we ha
We have successfully used the nvidia-smi tool to take the 2 A100's in a node
and split them into multiple GPU devices. In one case, we split the 2 GPUS
into 7 MIG devices each, so 14 in that node total, and in the other case, we
split the 2 GPUs into 2 MIG devices each, so 4 total in the node.