date:20241002

[slurm-users] slurmctld keeps segfaulting, possibly during or just after backfill

2024-10-02 Thread Marcus Lauer via slurm-users

We are running into a problem where slurmctld is segfaulting a few times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10 as well, though the problem only appears on one of the several SLURM clusters we have, and all of them use one of those versions of SLURM. I was wonde

[slurm-users] GPU Accounting

2024-10-02 Thread Emyr James via slurm-users

We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like sbatch --gres="gpu:1g.10gb:1"... and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only g