[slurm-users] Re: Slurm and NVIDIA NVML

2024-11-13 Thread Joshua Randall via slurm-users
Hi Matthias, Just another user here, but we did notice similar behaviour on our cluster with NVIDIA GPU nodes. For this cluster, we built slurm 24.05.1 deb packages from source ourselves on Ubuntu 22.04 with the `libnvidia-ml-dev` package installed directly from the Ubuntu package archive (using t

[slurm-users] Re: problem with squeue --json with version 24.05.1

2024-07-03 Thread Joshua Randall via slurm-users
Markus, I had a similar problem after upgrading from v23 to v24 but found that specifying _any_ valid data version worked for me, it was only specifying `--json` without a version that triggered an error (which in my case was I believe a segfault from sinfo rather than a malloc error from squeue -

[slurm-users] Re: Container Jobs "hanging"

2024-05-31 Thread Joshua Randall via slurm-users
Just an update to say that this issue for me appears to be specific to the `runc` runtime (or `nvidia-container-runtime` when it uses `runc` internally). I switched to using `crun` and the problem went away -- containers run using `srun --container` now terminate after the inner process terminates.

[slurm-users] Re: Container Jobs "hanging"

2024-05-28 Thread Joshua Randall via slurm-users
Hi Sean, I appear to be having the same issue that you are having with OCI container jobs running forever / appearing to hang. I haven't figured it out yet, but perhaps we can compare notes and determine what aspect of configuration we both share. Like you, I was following the examples in https:/