Various options that might help reduce job fragmentation.

        Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, 
SelectType, and Steps. With debugging set high enough one can see a good bit of 
the logic in regard to node selection.  
        
              CR_LLN Schedule  resources  to  jobs  on  the  least loaded nodes
                     (based upon the number of idle CPUs).  This  is  generally
                     only  recommended  for  an environment with serial jobs as
                     idle resources will tend to be highly fragmented,  result-
                     ing  in parallel jobs being distributed across many nodes.
                     Note that node Weight takes precedence over how many  idle
                     resources  are  on each node.  Also see the partition con-
                     figuration parameter LLN use the  least  loaded  nodes  in
                     selected partitions.

        Explore node weights.  If your nodes are not identical apply node 
weights to sort your nodes in the order of how you wish them to be selected; on 
the other hand, even for homogenous nodes you might try sets of weights to have 
the scheduler within a given scheduling cycle consider a smaller number of 
nodes of a weight before then considering the next number of nodes of the next 
weight. The number of nodes within a weight set might be no smaller than 1/3 or 
1/4 of the total partition size.  YMMV based on for instance ratio of serial 
jobs to MPI jobs, job length, etc. I have seen evidence that node allocation 
progresses roughly this way.

        Turn on backfill and educate users to better fit both their job 
resource requirements and the job runtime.   This will allow backfill to work 
more efficiently. Note that backfill choices are made within a given set of job 
within a partition. 


              CR_Pack_Nodes
                     If  a  job allocation contains more resources than will be
                     used for launching tasks (e.g. if whole  nodes  are  allo-
                     cated  to  a  job),  then rather than distributing a job's
                     tasks evenly across its  allocated  nodes,  pack  them  as
                     tightly as possible on these nodes.  For example, consider
                     a job allocation containing two entire  nodes  with  eight
                     CPUs  each.   If the job starts ten tasks across those two
                     nodes without this option, it will  start  five  tasks  on
                     each of the two nodes.  With this option, eight tasks will
                     be started on the first node and two tasks on  the  second
                     node.   This  can  be  superseded  by  "NoPack"  in srun's
                     "--distribution" option.  CR_Pack_Nodes only applies  when
                     the "block" task distribution method is used.

              pack_serial_at_end
                     If used with the select/cons_res or select/cons_tres plug-
                     in, then put serial jobs at the end of the available nodes
                     rather than using a best fit algorithm.  This  may  reduce
                     resource fragmentation for some workloads.

              reduce_completing_frag
                     This option is used to control how scheduling of resources
                     is  performed when jobs are in the COMPLETING state, which
                     influences potential fragmentation.  If this option is not
                     set then no jobs will be started in any partition when any
                     job is in the COMPLETING state for less than  CompleteWait
                     seconds.   If  this  option  is  set  then no jobs will be
                     started in any individual partition that has a job in COM-
                     PLETING  state  for  less  than  CompleteWait seconds.  In
                     addition, no jobs will be started in  any  partition  with
                     nodes  that overlap with any nodes in the partition of the
                     completing job.  This option is to be used in  conjunction
                     with CompleteWait.

-----Original Message-----
From: Gerhard Strangar via slurm-users <slurm-users@lists.schedmd.com> 
Sent: Tuesday, April 9, 2024 12:53 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Avoiding fragmentation

Hi,

I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By 
that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. 
As soon as that job with only 16 is running, the scheduler splits the next 128 
cpu jobs into 96+16 each, instead of assigning a full 128 cpu node to them. Is 
there a way for the administrator to achieve preferring full nodes?
The existence of pack_serial_at_end makes me believe there is not, because that 
basically is what I needed, apart from my serial jobs using
16 cpus instead of 1.

Gerhard

--
slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send 
an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to