Hi Matthias,
Just another user here, but we did notice similar behaviour on our cluster
with NVIDIA GPU nodes. For this cluster, we built slurm 24.05.1 deb
packages from source ourselves on Ubuntu 22.04 with the `libnvidia-ml-dev`
package installed directly from the Ubuntu package archive (using
t
Markus,
I had a similar problem after upgrading from v23 to v24 but found that
specifying _any_ valid data version worked for me, it was only
specifying `--json` without a version that triggered an error (which
in my case was I believe a segfault from sinfo rather than a malloc
error from squeue -
Just an update to say that this issue for me appears to be specific to
the `runc` runtime (or `nvidia-container-runtime` when it uses `runc`
internally). I switched to using `crun` and the problem went away --
containers run using `srun --container` now terminate after the inner
process terminates.
Hi Sean,
I appear to be having the same issue that you are having with OCI
container jobs running forever / appearing to hang. I haven't figured
it out yet, but perhaps we can compare notes and determine what aspect
of configuration we both share.
Like you, I was following the examples in
https:/