[slurm-users] Suggestions for Partition/QoS configuration

2024-04-04 Thread thomas.hartmann--- via slurm-users
Hi, we're testing possible slurm configurations on a test system right now. Eventually, it is going to serve ~1000 users. We're going to have some users who are going to run lots of short jobs (a couple of minutes to ~4h) and some users that run jobs that are going to run for days or weeks. I w

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-04 Thread Shooktija S N via slurm-users
Thank you for the response, it certainly clears up a few things, and the list of required packages is super helpful (where are these listed in the docs?). Here are a few follow up questions: I had installed Slurm (version 22.05) using apt by running 'apt install slurm-wlm'. Is it necessary to exe

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Loris Bennett via slurm-users
Hi Thomas, "thomas.hartmann--- via slurm-users" writes: > Hi, > we're testing possible slurm configurations on a test system right now. > Eventually, it is going to serve ~1000 users. > > We're going to have some users who are going to run lots of short jobs > (a couple of minutes to ~4h) and s

[slurm-users] Re: scrun: Failed to run the container due to GID mapping configuration

2024-04-04 Thread Markus Kötter via slurm-users
Hi, On 04.04.24 04:46, Toshiki Sonoda (Fujitsu) via slurm-users wrote: We set up scrun (slurm 23.11.5) integrated with rootless podman, I'd recommend looking into nvidia enroot instead. https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf MfG -- Markus Kötter, +49 681 870832

[slurm-users] Re: Slurm 23.11 - Unknown system variable 'wsrep_on'

2024-04-04 Thread Russell Jones via slurm-users
Thanks! I realized I made a mistake and had it still talking to an older slurmdbd system. On Wed, Apr 3, 2024 at 1:54 PM Timo Rothenpieler via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 02.04.2024 22:15, Russell Jones via slurm-users wrote: > > Hi all, > > > > I am working on upgrad

[slurm-users] SLURM configuration help

2024-04-04 Thread Alison Peterson via slurm-users
I am writing to seek assistance with a critical issue on our single-node system managed by Slurm. Our jobs are queued and marked as awaiting resources, but they are not starting despite seeming availability. I'm new with SLURM and my only experience was a class on installing it so I have no experie

[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show? On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by

[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
Yep, from your scontrol show node output: CfgTRES=cpu=64,mem=2052077M,billing=64 AllocTRES=cpu=1,mem=2052077M The running job (77) has allocated 1 CPU and all the memory on the node. That’s probably due to the partition using the default DefMemPerCPU value [1], which is unlimited. Since all ou

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Jerome Verleyen via slurm-users
Le 04/04/2024 à 03:33, Loris Bennett via slurm-users a écrit : I have never really understood the approach of having different partitions for different lengths of job, but it seems to be quite widespread, so I assume there are valid use cases. However, for our around 450 users, of which about 20

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Gerhard Strangar via slurm-users
thomas.hartmann--- via slurm-users wrote: > My idea was to basically have three partitions: > > 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] > PriorityTier=100 > 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] > PriorityTier=100 > 3. PartitionNam

[slurm-users] Re: [EXT] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Alison Peterson via slurm-users
Thank you That was the issue, I'm so happy :-) sending you many thanks. On Thu, Apr 4, 2024 at 10:11 AM Renfro, Michael wrote: > Yep, from your scontrol show node output: > > CfgTRES=cpu=64,mem=2052077M,billing=64 > AllocTRES=cpu=1,mem=2052077M > > > > The running job (77) has allocated 1 CP

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread thomas.hartmann--- via slurm-users
Hi, I'm currently testing an approach similar to the example by Loris. Why consider preemption? Because, in the original example, if the cluster is saturated by long running jobs (like 2 weeks), there should be the possibility to run short jobs right away. Best, Thomas -- slurm-users mailing