Re: [slurm-users] Larger jobs tend to get starved out on our cluster

Baker D . J . Wed, 16 Jan 2019 09:07:45 -0800

Hi Chris,


Thank you for your reply regarding OpenMPI and srun. When I try to run an mpi 
program using srun I find the following..


red[036-037]
[red036.cluster.local:308110] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not 
initialized
[red036.cluster.local:308107] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not 
initialized
[red036.cluster.local:308111] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not 
initialized
[red036.cluster.local:308101] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not 
initialized
[red036.cluster.local:308105] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not 
initialized


That's despite configuring with the pmi device. That is...


./configure --prefix=/local/software/openmpi/3.0.0/intel --enable-shared 
--enable-static --enable-mpi-cxx --with-verbs --disable-java --disable-mpi-java 
--disable-mpi-thread-multiple --without-ucx --with-pmi=/usr 
--with-pmi-libdir=/usr --with-slurm --with-hwloc=internal


On the other I find that mpirun detects the resources allocated to the job and 
the execute appears to run as expected. We did have a contract with SchedMD 
(expired now). I note that following their advice that we were able to upgrade 
slurm form 17* to 18* and didn't have to rebuild our OpenMPI installations for 
the new Slurm version. Actually, the above failure is common to Slurm v17 and 
18.


Best regards,

David

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
Christopher Benjamin Coffey <chris.cof...@nau.edu>
Sent: 14 January 2019 18:11:06
To: Slurm User Community List
Subject: Re: [slurm-users] Larger jobs tend to get starved out on our cluster

Hi David,

You are welcome. I'm surprised that srun does not work for you. We advise our 
users to use srun on every type of job, not just MPI. This in our opinion keeps 
it simple, and it just works. What is your MpiDefault set to in slurm.conf? Is 
your openmpi built with slurm support? I believe it’s the default, so it 
should. As you probably know, mpi implementations have to be recompiled when 
slurm is upgraded between major versions. FWIW, this is how we have openmpi 
configured on our cluster:

./configure --prefix=/packages/openmpi/3.1.3-gcc-6.2.0 
--with-ucx=/packages/ucx/1.4.0 --with-slurm --with-verbs --with-lustre 
--with-pmi

What happens when srun "doesn't work"?

I'm unaware of a way to suppress CR_Pack_Nodes in jobs.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


On 1/11/19, 10:07 AM, "slurm-users on behalf of Baker D.J." 
<slurm-users-boun...@lists.schedmd.com on behalf of d.j.ba...@soton.ac.uk> 
wrote:

    Hi Chris,


    Thank you for your comments. Yesterday I experimented with increasing the 
PriorityWeightJobSize and that does appear to have quite a profound effect on 
the job mix
     executing at any one time. Larger jobs (needing 5 nodes or above) are now 
getting a decent share of the nodes in the cluster. I've been running test jobs 
in between other bits of work and things are looking much better. I expected 
the change to be a little
     too aggressive, but the job mix is now very good overall.


    Thank you for your suggested changes to the slurm.conf...


    SelectTypeParameters=CR_Pack_Nodes
    SchedulerParameters=pack_serial_at_end, bf_busy_nodes


    I especially like the idea of using "CR_Pack_Nodes" since the same node 
packing policy is in operation on our Moab cluster. On the other hand we advise 
launching OpenMPI jobs using mpirun (it does work and it does detect the 
resources requested in the job).
     In fact despite installing OpenMPI with the pmi devices srun does not work 
for some reason! If you use mpirun, do you know if there is a valid way for 
users to suppress  CR_Pack_Nodes  in their jobs?


    Best regards,
    David

    ________________________________________
    From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
Skouson, Gary <gb...@psu.edu>
    Sent: 11 January 2019 16:53
    To: Slurm User Community List
    Subject: Re: [slurm-users] Larger jobs tend to get starved out on our 
cluster

    You should be able to turn on some backfill debug info from slurmctl, You 
can have slurm output the backfill info.  Take a look at DebugFlags settings 
using
     Backfill and BackfillMap.

    Your bf_window is set to 3600 or 2.5 days, if the start time of the large 
job is out further than that, it won’t get any nodes reserved.

    You may also want to take a look at the bf_window_linear parameter.  By 
default the backfill window search starts at 30 seconds and doubles at each 
iteration.
     Thus jobs that will need to wait a couple of days to gather the required 
resources will have a resolution in the backfill reservation that’s more than a 
day wide.  Even if nodes will be available 2 days from now, the “reservation” 
may be out 3 days, allowing
     2-day jobs to sneak in before the large job.  The result is that small 
jobs that last 1-2 days can delay the start of a large job for weeks.

    You can turn on bf_window_linear and it’ll keep that from happening.  
Unfortunately, it means that there are more backfill iterations required to 
search out
     multiple days into the future.   If you have relatively few jobs, that may 
not matter.  If you have lots of jobs, it’ll slow things down a bit.  You’ll 
have to do some testing to see if that’ll work for you.

    -----
    Gary Skouson


    From: slurm-users <slurm-users-boun...@lists.schedmd.com>
    On Behalf Of Baker D.J.
    Sent: Wednesday, January 9, 2019 11:40 AM
    To: slurm-users@lists.schedmd.com
    Subject: [slurm-users] Larger jobs tend to get starved out on our cluster



    Hello,



    A colleague intimated that he thought that larger jobs were tending to get 
starved out on our slurm cluster. It's not a busy time at the moment so it's 
difficult to test this
     properly. Back in November it was not completely unusual for a larger job 
to have to wait up to a week to start.



    I've extracted the key scheduling configuration out of the slurm.conf and I 
would appreciate your comments, please. Even at the busiest of times we notice 
many single compute
     jobs executing on the cluster -- starting either via the scheduler or by 
backfill.



    Looking at the scheduling configuration do you think that I'm favouring 
small jobs too much? That is, for example, should I increase the 
PriorityWeightJobSize to encourage larger
     jobs to run?



    I was very keen not to starve out small/medium jobs, however perhaps there 
is too much emphasis on small/medium jobs in our setup.



    My colleague is from a Moab background, and in that respect he was 
surprised not to see nodes being reserved for jobs, but it could be that Slurm 
works in a different way to
     try to make efficient use of the cluster by backfilling more aggressively 
than Moab. Certainly we see a great deal of activity from backfill.



    In this respect does anyone understand the mechanism used to reserve 
nodes/resources for jobs in slurm or potentially where to look for that type of 
information.



    Best regards,

    David



    SchedulerType=sched/backfill

    SchedulerParameters=bf_window=3600,bf_resolution=180,bf_max_job_user=4



    SelectType=select/cons_res

    SelectTypeParameters=CR_Core

    FastSchedule=1

    PriorityFavorSmall=NO

    PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE

    PriorityType=priority/multifactor

    PriorityDecayHalfLife=14-0



    PriorityWeightFairshare=1000000

    PriorityWeightAge=100000

    PriorityWeightPartition=0

    PriorityWeightJobSize=100000

    PriorityWeightQOS=10000

    PriorityMaxAge=7-0

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

Reply via email to