Re: [slurm-users] slurm and singularity

2023-02-07 Thread Markus Kötter
Hi, On 08.02.23 05:00, Carl Ponder wrote: Take a look at this extension to SLURM: https://github.com/NVIDIA/pyxis https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf enroot & pyxis - great recommendation for rootless containerized runtime environments in HPC. Free software, no lic

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Carl Ponder
Take a look at this extension to SLURM: https://github.com/NVIDIA/pyxis You put the container path on the srun command-line and each rank runs inside it's own copy of the image. Subject:[slurm-users] slurm an

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Jeffrey T Frey
> The remaining issue then is how to put them into an allocation that is > actually running a singularity container. I don't get how what I'm doing now > is resulting in an allocation where I'm in a container on the submit node > still! Try prefixing the singularity command with "srun" e.g.

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Groner, Rob
Looks like we can go the route of a wrapper script, since our users don't specifically need to know they're running an sbatch. Thanks for the suggestion. The remaining issue then is how to put them into an allocation that is actually running a singularity container. I don't get how what I'm do

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Howdy, On Tue, 7 Feb 2023 at 20:18, Sean Mc Grath wrote: > Hi Analabha, > > Yes, unfortunately for your needs, I expect a time limited reservation > along my suggestion would not accept jobs that would be scheduled to end > outside of the reservations availability times. I'd suggest looking at >

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Brian Andrus
You should have the job script itself have the singularity/apptainer command. I am guessing you don't want your users to have to deal with that part for their scripts, so I would suggest using a wrapper script. You could just have them run something like: cluster_run.sh Then cluster_run.s

[slurm-users] slurm and singularity

2023-02-07 Thread Groner, Rob
I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that job will execute "singularity exec script_to_run.sh" Also, that they could execute: $: salloc and would end up on a node per their paramet

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Sean Mc Grath
Hi Analabha, Yes, unfortunately for your needs, I expect a time limited reservation along my suggestion would not accept jobs that would be scheduled to end outside of the reservations availability times. I'd suggest looking at check-pointing in this case, e.g. with DMTCP: Distributed MultiThre

[slurm-users] job_res_rm_job: plugin still initializing

2023-02-07 Thread Loris Bennett
Hi, The other day we updated to 22.05.8. We are interested in using sharding with our GPUs, so after the update had finished, we changed SelectType=select/cons_res to SelectType=select/cons_tres This seemed to cause the slurmctld to loose contact with the slurmstepds, so that a large numb

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato
That's probably not optimal, but could work. I'd go with brutal preemption: swapping 90+G can be quite time-consuming. Diego Il 07/02/2023 14:18, Analabha Roy ha scritto: On Tue, 7 Feb 2023, 18:12 Diego Zuccato, > wrote: RAM used by a suspended job is not

Re: [slurm-users] Debian dist-upgrade?

2023-02-07 Thread Steffen Grunewald
Hi Loris, On Tue, 2023-01-24 at 16:48:26 +0100, Loris Bennett wrote: > Hi Steffen, > > Could you create/find a deb-package for a Slurm 19.x version to use as > an intermediate? Never having built such a package, I don't now how > much trouble that would be. Actually, I found a 20.02.6 version t

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
On Tue, 7 Feb 2023, 18:12 Diego Zuccato, wrote: > RAM used by a suspended job is not released. At most it can be swapped > out (if enough swap is available). > There should be enough swap available. I have 93 gigs of Ram and as big a swap partition. I can top it off with swap files if needed.

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Diego Zuccato
RAM used by a suspended job is not released. At most it can be swapped out (if enough swap is available). Il 07/02/2023 13:14, Analabha Roy ha scritto: Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation would turn down jobs that are too big for the reservation. It'd be nice if slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved down

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Sean Mc Grath
Hi Analabha, Could you do something like create a daily reservation for 8 hours that starts at 9am, or whatever times work for you like the following untested command: scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 flags=daily ReservationName=daily Daily option at ht

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one