[slurm-users] Re: Container Jobs "hanging"

Joshua Randall via slurm-users Fri, 31 May 2024 02:38:26 -0700

Just an update to say that this issue for me appears to be specific to
the `runc` runtime (or `nvidia-container-runtime` when it uses `runc`
internally). I switched to using `crun` and the problem went away --
containers run using `srun --container` now terminate after the inner
process terminates.


--
Dr. Joshua C. Randall
Principal Software Engineer
Altos Labs
email: jrand...@altoslabs.com
On Tue, May 28, 2024 at 2:18 PM Joshua Randall <jrand...@altoslabs.com> wrote:
>
> Hi Sean,
>
> I appear to be having the same issue that you are having with OCI
> container jobs running forever / appearing to hang. I haven't figured
> it out yet, but perhaps we can compare notes and determine what aspect
> of configuration we both share.
>
> Like you, I was following the examples in
> https://slurm.schedmd.com/containers.html and originally encountered
> the issue with an alpine container image running the `uptime` command,
> but I have also confirmed the issue with other images including ubuntu
> and with other processes. I always get the same results - the
> container process runs to completion and exits, but then the slurm job
> continues to run until it is cancelled or killed.
>
> I have slurm v23.11.6 and am using the nvidia-container-runtime, what
> slurm version and runtime are you using?
>
> My oci.conf is:
> ```
> $ cat /etc/slurm/oci.conf
> EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
> RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)="
> RunTimeQuery="nvidia-container-runtime --rootless=true
> --root=/run/user/%U/ state %n.%u.%j.%s.%t"
> RunTimeKill="nvidia-container-runtime --rootless=true
> --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t"
> RunTimeDelete="nvidia-container-runtime --rootless=true
> --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t"
> RunTimeRun="nvidia-container-runtime --rootless=true
> --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b"
> ```
>
> Hope that we can get to the bottom of this and resolve our issues with
> OCI containers!
>
> Josh.
>
>
> ---
> Hello. I am new to this list and Slurm overall. I have a lot of
> experience in computer operations, including Kubernetes, but I am
> currently exploring Slurm in some depth.
>
> I have set up a small cluster and, in general, have gotten things
> working, but when I try to run a container job, it runs the command
> but then appears to hang as if the job container is still running.
>
> So, running the following works, but it never returns to the prompt
> unless I use [Control-C].
>
> $ srun --container /shared_fs/shared/oci_images/alpine uptime
> 19:21:47 up 20:43, 0 users, load average: 0.03, 0.25, 0.15
>
> I'm unsure if something is misconfigured or if I'm misunderstanding
> how this should work, but any help and/or pointers would be greatly
> appreciated.
>
> Thanks!
> Sean
>
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com
>
> --
> Dr. Joshua C. Randall
> Principal Software Engineer
> Altos Labs
> email: jrand...@altoslabs.com

-- 
Altos Labs UK Limited | England | Company reg 13484917  
Registered 
address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, 
WA14 2DT

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Container Jobs "hanging"

Reply via email to