Just an update to say that this issue for me appears to be specific to the `runc` runtime (or `nvidia-container-runtime` when it uses `runc` internally). I switched to using `crun` and the problem went away -- containers run using `srun --container` now terminate after the inner process terminates.
-- Dr. Joshua C. Randall Principal Software Engineer Altos Labs email: jrand...@altoslabs.com On Tue, May 28, 2024 at 2:18 PM Joshua Randall <jrand...@altoslabs.com> wrote: > > Hi Sean, > > I appear to be having the same issue that you are having with OCI > container jobs running forever / appearing to hang. I haven't figured > it out yet, but perhaps we can compare notes and determine what aspect > of configuration we both share. > > Like you, I was following the examples in > https://slurm.schedmd.com/containers.html and originally encountered > the issue with an alpine container image running the `uptime` command, > but I have also confirmed the issue with other images including ubuntu > and with other processes. I always get the same results - the > container process runs to completion and exits, but then the slurm job > continues to run until it is cancelled or killed. > > I have slurm v23.11.6 and am using the nvidia-container-runtime, what > slurm version and runtime are you using? > > My oci.conf is: > ``` > $ cat /etc/slurm/oci.conf > EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" > RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" > RunTimeQuery="nvidia-container-runtime --rootless=true > --root=/run/user/%U/ state %n.%u.%j.%s.%t" > RunTimeKill="nvidia-container-runtime --rootless=true > --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" > RunTimeDelete="nvidia-container-runtime --rootless=true > --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" > RunTimeRun="nvidia-container-runtime --rootless=true > --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b" > ``` > > Hope that we can get to the bottom of this and resolve our issues with > OCI containers! > > Josh. > > > --- > Hello. I am new to this list and Slurm overall. I have a lot of > experience in computer operations, including Kubernetes, but I am > currently exploring Slurm in some depth. > > I have set up a small cluster and, in general, have gotten things > working, but when I try to run a container job, it runs the command > but then appears to hang as if the job container is still running. > > So, running the following works, but it never returns to the prompt > unless I use [Control-C]. > > $ srun --container /shared_fs/shared/oci_images/alpine uptime > 19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15 > > I'm unsure if something is misconfigured or if I'm misunderstanding > how this should work, but any help and/or pointers would be greatly > appreciated. > > Thanks! > Sean > > -- > slurm-users mailing list -- slurm...@lists.schedmd.com > To unsubscribe send an email to slurm-us...@lists.schedmd.com > > -- > Dr. Joshua C. Randall > Principal Software Engineer > Altos Labs > email: jrand...@altoslabs.com -- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com