date:20221115

Re: [slurm-users] NVIDIA MIG question

2022-11-15 Thread Laurence

Hi Rob, Yes, those questions make sense. From what I understand, MIG should essentially split the GPU so that they behave as separate cards. Hence two different users should be able to use two different MIG instances at the same time and also a single job could use all 14 instances. The resu

[slurm-users] SSH Freeze on extern step with pam_slurm_adopt

2022-11-15 Thread Lucio Delelis

We’re seeing a repeated issue of long-running node allocations eventually disallowing SSH connections. Our cluster configures the pam_slurm_adopt module, in order to allow users to access nodes they’ve allocated before. However, even if this allocated node is idle, after around 24 hours (we ha

[slurm-users] NVIDIA MIG question

2022-11-15 Thread Groner, Rob

We have successfully used the nvidia-smi tool to take the 2 A100's in a node and split them into multiple GPU devices. In one case, we split the 2 GPUS into 7 MIG devices each, so 14 in that node total, and in the other case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.