[slurm-users] Suggestions for Partition/QoS configuration

2024-04-04 Thread thomas.hartmann--- via slurm-users
Hi,
we're testing possible slurm configurations on a test system right now. 
Eventually, it is going to serve ~1000 users.

We're going to have some users who are going to run lots of short jobs (a 
couple of minutes to ~4h) and some users that run jobs that are going to run 
for days or weeks. I want to avoid a situation in which a group of users 
basically saturates the whole cluster with jobs that run for a week or two and 
nobody could run any short jobs anymore. I also would like to favor short jobs, 
because they make the whole cluster feel more dynamic and agile for everybody.

On the other hand, I would like to make the most of the ressources, i.e. when 
nobody is sending short jobs, long jobs could run on all the nodes.

My idea was to basically have three partitions:

1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]  
PriorityTier=100
2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] 
PriorityTier=100
3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] 
PriorityTier=40 PreemptMode=requeue

and then use the JobSubmitPlugin "all_partitions" so that all jobs get 
submitted to all partitions by default. This way, a short job ends up in the 
`short` partition and is able to use all nodes. A long job ends up using the 
`long_safe` partition until for the first 50 nodes. These jobs are not going to 
be preempted. Remaining long jobs use the `long_preempt` queue. So they run on 
the remaining nodes as long as there are no higher prio short (or long) jobs in 
the queue.

So, the cluster could be saturated with long running jobs but if short jobs are 
submitted and the user has a high enough fair share, some of the long jobs 
would get preempted and the short ones would run.

This scenario works fine BUT the long jobs seem to be playing pingpong on 
the `long_preempt` partition because as soon as they run, they stop accruing 
AGE priority unlike still queued jobs. As soon as a queued job, albeit by the 
same user, "overtakes" a running one, it preempts the running one, stops 
accruing age and so on

So, is there maybe a cleverer way to do this?

Thanks a lot!
Thomas

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-04 Thread Shooktija S N via slurm-users
Thank you for the response, it certainly clears up a few things, and the
list of required packages is super helpful (where are these listed in the
docs?).

Here are a few follow up questions:

I had installed Slurm (version 22.05) using apt by running 'apt install
slurm-wlm'. Is it necessary to execute a command like 'apt-get autoremove
slurm-wlm' to compile the Slurm source code from scratch, as you've
described?

You have given this command as an example:
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam
--with-pam=/usr" --define="_with_pmix --with-pmix=/usr"
--define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed"
--define="_with_http_parser --with-http-parser=/usr/lib64"
--define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd
--with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date
+%F` 2>&1

Are the options you've used in this example command fairly standard options
for a 'general' installation of Slurm? Where can I learn more about these
options to make sure that I don't miss any important options that might be
necessary for the specs of my cluster?

Would I have to add the paths to the compiled binaries to the PATH or
LD_LIBRARY_PATH environment variables?

My nodes are running an OS based on Debian 12 (Proxmox VE), what is the
'rpmbuild' equivalent for my OS? Would the syntax used in your example
command be the same for any build tool?

Thanks!


On Wed, Apr 3, 2024 at 9:18 PM Williams, Jenny Avis 
wrote:

> Slurm source code should be downloaded and recompiled including the
> configuration flag – with-nvml.
>
>
>
>
>
> As an example, using rpmbuild mechanism for recompiling and generating
> rpms, this is our current method.  Be aware that the compile works only if
> it finds the prerequisites needed for a given option on the host. (* e.g.
> to recompile this –with-nvml you should do so on a functioning gpu host *)
>
>
>
> 
>
>
>
> export VERSION=23.11.5
>
>
>
>
>
> wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2
>
> #
>
> rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam
> --with-pam=/usr" --define="_with_pmix --with-pmix=/usr"
> --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed"
> --define="_with_http_parser --with-http-parser=/usr/lib64"
> --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd
> --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date
> +%F` 2>&1
>
>
>
>
>
> This is a list of packages we ensure are installed on a given node when
> running this compile .
>
>
>
> - pkgs:
>
>   - bzip2
>
>   - cuda-nvml-devel-12-2
>
>   - dbus-devel
>
>   - freeipmi
>
>   - freeipmi-devel
>
>   - gcc
>
>   - gtk2-devel
>
>   - hwloc-devel
>
>   - libjwt-devel
>
>   - libssh2-devel
>
>   - libyaml-devel
>
>   - lua-devel
>
>   - make
>
>   - mariadb-devel
>
>   - munge-devel
>
>   - munge-libs
>
>   - ncurses-devel
>
>   - numactl-devel
>
>   - openssl-devel
>
>   - pam-devel
>
>   - perl
>
>   - perl-ExtUtils-MakeMaker
>
>   - readline-devel
>
>   - rpm-build
>
>   - rpmdevtools
>
>   - rrdtool-devel
>
>   - http-parser-devel
>
>   - json-c-devel
>
>
>
> *From:* Shooktija S N via slurm-users 
> *Sent:* Wednesday, April 3, 2024 7:01 AM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] How to reinstall / reconfigure Slurm?
>
>
>
> Hi,
>
>
>
> I am setting up Slurm on our lab's 3 node cluster and I have run into a
> problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
> There is an error at the 'debug' log level in slurmd.log that says that the
> GPU is file-less and is being removed from the final GRES list. This error
> according to some older posts on this forum might be fixed by reinstalling
> / reconfiguring Slurm with the right flag (the '--with-nvml' flag according
> to this  post).
>
>
>
> Line in /var/log/slurmd.log:
>
> [2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
> final GRES list
>
>
>
> Does this error require me to either reinstall / reconfigure Slurm? What
> does 'reconfigure Slurm' mean?
>
> I'm about as clueless as a caveman with a smartphone when it comes to
> Slurm administration and Linux system administration in general. So, if you
> could, please explain it to me as simply as possible.
>
>
>
> slurm.conf without comment lines:
>
> ClusterName=DlabCluster
> SlurmctldHost=server1
> GresTypes=gpu
> ProctrackType=proctrack/linuxproc
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=root
> StateSaveLocation=/var/spool/slurmctld
> TaskPlugin=task/affinity,task/cgroup
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waitti

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Loris Bennett via slurm-users
Hi Thomas,

"thomas.hartmann--- via slurm-users" 
writes:

> Hi,
> we're testing possible slurm configurations on a test system right now. 
> Eventually, it is going to serve ~1000 users.
>
> We're going to have some users who are going to run lots of short jobs
> (a couple of minutes to ~4h) and some users that run jobs that are
> going to run for days or weeks. I want to avoid a situation in which a
> group of users basically saturates the whole cluster with jobs that
> run for a week or two and nobody could run any short jobs anymore. I
> also would like to favor short jobs, because they make the whole
> cluster feel more dynamic and agile for everybody.
>
> On the other hand, I would like to make the most of the ressources,
> i.e. when nobody is sending short jobs, long jobs could run on all the
> nodes.
>
> My idea was to basically have three partitions:
>
> 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]  
> PriorityTier=100
> 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] 
> PriorityTier=100
> 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] 
> PriorityTier=40 PreemptMode=requeue
>
> and then use the JobSubmitPlugin "all_partitions" so that all jobs get
> submitted to all partitions by default. This way, a short job ends up
> in the `short` partition and is able to use all nodes. A long job ends
> up using the `long_safe` partition until for the first 50 nodes. These
> jobs are not going to be preempted. Remaining long jobs use the
> `long_preempt` queue. So they run on the remaining nodes as long as
> there are no higher prio short (or long) jobs in the queue.
>
> So, the cluster could be saturated with long running jobs but if short
> jobs are submitted and the user has a high enough fair share, some of
> the long jobs would get preempted and the short ones would run.
>
> This scenario works fine BUT the long jobs seem to be playing
> pingpong on the `long_preempt` partition because as soon as they run,
> they stop accruing AGE priority unlike still queued jobs. As soon as a
> queued job, albeit by the same user, "overtakes" a running one, it
> preempts the running one, stops accruing age and so on
>
> So, is there maybe a cleverer way to do this?
>
> Thanks a lot!
> Thomas

I have never really understood the approach of having different
partitions for different lengths of job, but it seems to be quite
widespread, so I assume there are valid use cases.

However, for our around 450 users, of which about 200 will submit at
least one job in a given month, we have an alternative approach without
pre-emption where we essentially have just a single partition.  Users
can then specify a QOS which will increase priority at the cost of
accepting a lower cap on number of jobs/resources/maximum runtime:

$ sqos
  Name   Priority MaxWall MaxJobs MaxSubmitMaxTRESPU 
-- -- --- --- -  
hiprio 1003:00:00  50   100   cpu=128,gres/gpu=4 
  prio   1000  3-00:00:00 500  1000   cpu=256,gres/gpu=8 
  standard  0 14-00:00:002000 1  cpu=768,gres/gpu=16 

where

  alias sqos='sacctmgr show qos 
format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20'
  /usr/bin/sacctmgr

The standard cap on the resources corresponds to about 1/7 of our cores.

The downside is that very occasionally nodes may idle because a user has
reached his or her cap.  However, we have usually have enough uncapped
users submitting jobs, so that in fact this happens only rarely, such as
sometimes at Christmas or New Year.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: scrun: Failed to run the container due to GID mapping configuration

2024-04-04 Thread Markus Kötter via slurm-users

Hi,


On 04.04.24 04:46, Toshiki Sonoda (Fujitsu) via slurm-users wrote:
We set up scrun (slurm 23.11.5) integrated with rootless podman, 



I'd recommend looking into nvidia enroot instead.

https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf 





MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm 23.11 - Unknown system variable 'wsrep_on'

2024-04-04 Thread Russell Jones via slurm-users
Thanks! I realized I made a mistake and had it still talking to an older
slurmdbd system.

On Wed, Apr 3, 2024 at 1:54 PM Timo Rothenpieler via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> On 02.04.2024 22:15, Russell Jones via slurm-users wrote:
> > Hi all,
> >
> > I am working on upgrading a Slurm cluster from 20 -> 23. I was
> > successfully able to upgrade to 22, however now that I am trying to go
> > from 22 to 23, starting slurmdbd results in the following error being
> > logged:
> >
> > error: mysql_query failed: 1193 Unknown system variable 'wsrep_on'
>
> I get that error in my log every startup, and it's benign.
> That variable only exists on a Galera Cluster, so seeing it on a simple
> mariadb instance is to be expected and benign.
>
> >
> > When trying to start slurmctld, I get:
> >
> > [2024-04-02T15:09:52.439] Couldn't find tres gres/gpumem in the
> > database, creating.
> > [2024-04-02T15:09:52.439] Couldn't find tres gres/gpuutil in the
> > database, creating.
> > [2024-04-02T15:09:52.440] fatal: Problem adding tres to the database,
> > can't continue until database is able to make new tres
> >
> >
> > Any ideas what could be causing these errors? Is MariaDB 5.5 still
> > officially supported?
>
> Check permissions of your database user, it has to be able to create and
> alter tables.
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] SLURM configuration help

2024-04-04 Thread Alison Peterson via slurm-users
I am writing to seek assistance with a critical issue on our single-node
system managed by Slurm. Our jobs are queued and marked as awaiting
resources, but they are not starting despite seeming availability. I'm new
with SLURM and my only experience was a class on installing it so I have no
experience, running it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says *NODELIST(REASON)
(Resources*). I've checked that our single node has enough RAM (2TB) and
CPU's (64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00
MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force


System Details: We have a single-node setup with Slurm as the workload
manager. The node appears to have sufficient resources for the queued jobs.

Troubleshooting Performed:
Configuration Checks: I have verified all Slurm configurations and the
system's resource availability, which should not be limiting job execution.
Service Status: The Slurm daemon slurmdbd is active and running without any
reported issues. System resource monitoring shows no shortages that would
prevent job initiation.

Any guidance and help will be deeply appreciated!

-- 
*Alison Peterson*
IT Research Support Analyst
*Information Technology*
apeters...@sdsu.edu 
O: 619-594-3364
*San Diego State University | SDSU.edu *
5500 Campanile Drive | San Diego, CA 92182-8080

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” 
show?

On one job we currently have that’s pending due to Resources, that job has 
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the 
node it wants to run on only has 37 CPUs available (seen by comparing its 
CfgTRES= and AllocTRES= values).

From: Alison Peterson via slurm-users 
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I am writing to seek assistance with a critical issue on our single-node system 
managed by Slurm. Our jobs are queued and marked as awaiting resources, but 
they are not starting despite seeming availability. I'm new with SLURM and my 
only experience was a class on installing it so I have no experience, running 
it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) 
(Resources). I've checked that our single node has enough RAM (2TB) and CPU's 
(64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionName=mainpart Default=YES MinNodes=1 DefaultTime=00:60:00 
MaxTime=UNLIMITED AllowAccounts=ALL Nodes=ALL State=UP OverSubscribe=Force


System Details: We have a single-node setup with Slurm as the workload manager. 
The node appears to have sufficient resources for the queued jobs.
Troubleshooting Performed:
Configuration Checks: I have verified all Slurm configurations and the system's 
resource availability, which should not be limiting job execution.
Service Status: The Slurm daemon slurmdbd is active and running without any 
reported issues. System resource monitoring shows no shortages that would 
prevent job initiation.

Any guidance and help will be deeply appreciated!

--
Alison Peterson
IT Research Support Analyst
Information Technology
apeters...@sdsu.edu
O: 619-594-3364
San Diego State University | SDSU.edu
5500 Campanile Drive | San Diego, CA 92182-8080
[Image removed by sender.]


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
Yep, from your scontrol show node output:

CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M

The running job (77) has allocated 1 CPU and all the memory on the node. That’s 
probably due to the partition using the default DefMemPerCPU value [1], which 
is unlimited.

Since all our nodes are shared, and our workloads vary widely, we set our 
DefMemPerCPU value to something considerably lower than 
mem_in_node/cores_in_node . That way, most jobs will leave some memory 
available by default, and other jobs can use that extra memory as long as CPUs 
are available.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU

From: Alison Peterson 
Date: Thursday, April 4, 2024 at 11:58 AM
To: Renfro, Michael 
Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Here is the info:
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco

NodeName=cusco Arch=x86_64 CoresPerSocket=32
   CPUAlloc=1 CPUTot=64 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:4
   NodeAddr=cusco NodeHostName=cusco Version=19.05.5
   OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
   RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=mainpart
   BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
   CfgTRES=cpu=64,mem=2052077M,billing=64
   AllocTRES=cpu=1,mem=2052077M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
78  mainpart CF1090_w  sma PD   0:00  1 (Resources)
77  mainpart CF_w  sma  R   0:26  1 cusco
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78

JobId=78 JobName=CF1090_wOcean500m.shell
   UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
   Priority=4294901720 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
   AccrueTime=2024-04-04T09:55:34
   StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
   Partition=mainpart AllocNode:Sid=newcusco:2450574
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cusco
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
   WorkDir=/data/work/sma-scratch/tohoku_wOcean
   StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   StdIn=/dev/null
   StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
   Power=

On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael 
mailto:ren...@tntech.edu>> wrote:
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” 
show?

On one job we currently have that’s pending due to Resources, that job has 
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the 
node it wants to run on only has 37 CPUs available (seen by comparing its 
CfgTRES= and AllocTRES= values).

From: Alison Peterson via slurm-users 
mailto:slurm-users@lists.schedmd.com>>
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm-users@lists.schedmd.com 
mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] SLURM configuration help

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


I am writing to seek assistance with a critical issue on our single-node system 
managed by Slurm. Our jobs are queued and marked as awaiting resources, but 
they are not starting despite seeming availability. I'm new with SLURM and my 
only experience was a class on installing it so I have no experience, running 
it or using it.

Issue Summary:

Main Problem: Jobs submitted only one run and the second says NODELIST(REASON) 
(Resources). I've checked that our single node has enough RAM (2TB) and CPU's 
(64) available.

# COMPUTE NODES
NodeName=cusco CPUs=64 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 
RealMemory=2052077 Gres=gpu:1,gpu:1,gpu:1,gpu:1
PartitionNa

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Jerome Verleyen via slurm-users

Le 04/04/2024 à 03:33, Loris Bennett via slurm-users a écrit :

I have never really understood the approach of having different
partitions for different lengths of job, but it seems to be quite
widespread, so I assume there are valid use cases.

However, for our around 450 users, of which about 200 will submit at
least one job in a given month, we have an alternative approach without
pre-emption where we essentially have just a single partition.  Users
can then specify a QOS which will increase priority at the cost of
accepting a lower cap on number of jobs/resources/maximum runtime:

$ sqos
   Name   Priority MaxWall MaxJobs MaxSubmitMaxTRESPU
-- -- --- --- - 
 hiprio 1003:00:00  50   100   cpu=128,gres/gpu=4
   prio   1000  3-00:00:00 500  1000   cpu=256,gres/gpu=8
   standard  0 14-00:00:002000 1  cpu=768,gres/gpu=16

where

   alias sqos='sacctmgr show qos 
format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20'
   /usr/bin/sacctmgr

The standard cap on the resources corresponds to about 1/7 of our cores.

The downside is that very occasionally nodes may idle because a user has
reached his or her cap.  However, we have usually have enough uncapped
users submitting jobs, so that in fact this happens only rarely, such as
sometimes at Christmas or New Year.

Cheers,

Loris


Hi Loris, Tomas

I'm new too in using slurm shceduler.

In your configuration, you have to define a DefaultQOS for each User, or 
Association, right? You don't defina DefaultQOS at partition nivel..


Thank's!


--
-- Jérôme
L'amour, c'est comme le potage : les premières cuillerées sont trop chaudes,
les dernières sont trop froides.
(Jeanne Moreau)

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread Gerhard Strangar via slurm-users
thomas.hartmann--- via slurm-users wrote:

> My idea was to basically have three partitions:
> 
> 1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99]  
> PriorityTier=100
> 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] 
> PriorityTier=100
> 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] 
> PriorityTier=40 PreemptMode=requeue

I don't know why you consider preemption if you have short jobs, just
wait for jobs to finish.

My first approach would be to have two partitions, both of them
containing all nodes, but diffent QoSes assigned to them, so you can
limit the short jobs to a certain amount of cpus and also limit long
jobs to a certain amount of cpus - maybe 80% for each of them.

Gerhard

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Alison Peterson via slurm-users
Thank you That was the issue, I'm so happy :-) sending you many thanks.

On Thu, Apr 4, 2024 at 10:11 AM Renfro, Michael  wrote:

> Yep, from your scontrol show node output:
>
> CfgTRES=cpu=64,mem=2052077M,billing=64
> AllocTRES=cpu=1,mem=2052077M
>
>
>
> The running job (77) has allocated 1 CPU and all the memory on the node.
> That’s probably due to the partition using the default DefMemPerCPU value
> [1], which is unlimited.
>
>
>
> Since all our nodes are shared, and our workloads vary widely, we set our
> DefMemPerCPU value to something considerably lower than
> mem_in_node/cores_in_node . That way, most jobs will leave some memory
> available by default, and other jobs can use that extra memory as long as
> CPUs are available.
>
>
>
> [1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU
>
>
>
> *From: *Alison Peterson 
> *Date: *Thursday, April 4, 2024 at 11:58 AM
> *To: *Renfro, Michael 
> *Subject: *Re: [EXT] Re: [slurm-users] SLURM configuration help
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Here is the info:
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco*
>
>
> NodeName=cusco Arch=x86_64 CoresPerSocket=32
>CPUAlloc=1 CPUTot=64 CPULoad=0.02
>AvailableFeatures=(null)
>ActiveFeatures=(null)
>Gres=gpu:4
>NodeAddr=cusco NodeHostName=cusco Version=19.05.5
>OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
>RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
>State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>Partitions=mainpart
>BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
>CfgTRES=cpu=64,mem=2052077M,billing=64
>AllocTRES=cpu=1,mem=2052077M
>CapWatts=n/a
>CurrentWatts=0 AveWatts=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue*
>
>
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
> 78  mainpart CF1090_w  sma PD   0:00  1
> (Resources)
> 77  mainpart CF_w  sma  R   0:26  1 cusco
>
> *sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78*
>
>
> JobId=78 JobName=CF1090_wOcean500m.shell
>UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
>Priority=4294901720 Nice=0 Account=(null) QOS=(null)
>JobState=PENDING Reason=Resources Dependency=(null)
>Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
>SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
>AccrueTime=2024-04-04T09:55:34
>StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
>SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
>Partition=mainpart AllocNode:Sid=newcusco:2450574
>ReqNodeList=(null) ExcNodeList=(null)
>NodeList=(null) SchedNodeList=cusco
>NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>TRES=cpu=1,node=1,billing=1
>Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>Features=(null) DelayBoot=00:00:00
>OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
>WorkDir=/data/work/sma-scratch/tohoku_wOcean
>StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
>StdIn=/dev/null
>StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
>Power=
>
>
>
> On Thu, Apr 4, 2024 at 8:57 AM Renfro, Michael  wrote:
>
> What does “scontrol show node cusco” and “scontrol show job
> PENDING_JOB_ID” show?
>
>
>
> On one job we currently have that’s pending due to Resources, that job has
> requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but
> the node it wants to run on only has 37 CPUs available (seen by comparing
> its CfgTRES= and AllocTRES= values).
>
>
>
> *From: *Alison Peterson via slurm-users 
> *Date: *Thursday, April 4, 2024 at 10:43 AM
> *To: *slurm-users@lists.schedmd.com 
> *Subject: *[slurm-users] SLURM configuration help
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> I am writing to seek assistance with a critical issue on our single-node
> system managed by Slurm. Our jobs are queued and marked as awaiting
> resources, but they are not starting despite seeming availability. I'm new
> with SLURM and my only experience was a class on installing it so I have no
> experience, running it or using it.
>
> Issue Summary:
>
> Main Problem: Jobs submitted only one run and the second says 
> *NODELI

[slurm-users] Re: Suggestions for Partition/QoS configuration

2024-04-04 Thread thomas.hartmann--- via slurm-users
Hi,
I'm currently testing an approach similar to the example by Loris.

Why consider preemption? Because, in the original example, if the cluster is 
saturated by long running jobs (like 2 weeks), there should be the possibility 
to run short jobs right away.

Best,
Thomas

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com