We are running into a problem where slurmctld is segfaulting a few
times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10
as well, though the problem only appears on one of the several SLURM
clusters we have, and all of them use one of those versions of SLURM. I was
wonde
We have a node with 8 H100 GPUs that are split into MIG instances. We are using
cgroups. This seems to work fine. Users can do something like
sbatch --gres="gpu:1g.10gb:1"...
and the job starts on the node with the gpus and cuda visible devices and the
pytorch debug shows that the cgroup only g