Hi Gerhard,

I am not sure if this counts as administrative measure, but we do
highly encourage our users to always explicitely specify --nodes=n 
together with --ntasks-per-node=m (rather than just --ntasks=n*m and 
omitting --nodes option, which may lead to cores allocated here and 
there and everywhere as long as network topology allows this).

I do understand Loris' and Tim's arguments, but for certain reasons we
have configured single user node access policy (ExclusiveUser=YES),
which allows multiple jobs to share a node, but only jobs owned by
one and the same user. So we also try to avoid fragmentation whenever
possible and want users to pack their jobs as densely as possible on
the nodes in order to leave as many nodes as possible available for
others. For us, this works reasonably well in terms of core
utilization because we have almost no users who submit only one or two
few-core jobs at a time but usually whole bunches of such jobs
(sometimes hundreds) at once of which multiple jobs then
simultaneously run on the individual nodes. That keeps the waste of
unallocated cores on individual nodes within acceptable limits for us.

Best regards
Jürgen


* Loris Bennett via slurm-users <slurm-users@lists.schedmd.com> [240409 07:51]:
> Hi Gerhard,
> 
> Gerhard Strangar via slurm-users <slurm-users@lists.schedmd.com> writes:
> 
> > Hi,
> >
> > I'm trying to figure out how to deal with a mix of few- and many-cpu
> > jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> > jobs with only 16. As soon as that job with only 16 is running, the
> > scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> > assigning a full 128 cpu node to them. Is there a way for the
> > administrator to achieve preferring full nodes?
> > The existence of pack_serial_at_end makes me believe there is not,
> > because that basically is what I needed, apart from my serial jobs using
> > 16 cpus instead of 1.
> >
> > Gerhard
> 
> This may well not be relevant for your case, but we actively discourage
> the use of full nodes for the following reasons:
> 
>   - When the cluster is full, which is most of the time, MPI jobs in
>     general will start much faster if they don't specify the number of
>     nodes and certainly don't request full nodes.  The overhead due to
>     the jobs being scattered across nodes is often much lower than the
>     additional waiting time incurred by requesting whole nodes.
> 
>   - When all the cores of a node are requested, all the memory of the
>     node becomes unavailable to other jobs, regardless of how much
>     memory is requested or indeed how much is actually used.  This holds
>     up jobs with low CPU but high memory requirements and thus reduces
>     the total throughput of the system.
> 
> These factors are important for us because we have a large number of
> single core jobs and almost all the users, whether doing MPI or not,
> significantly overestimate the memory requirements of their jobs.
> 
> Cheers,
> 
> Loris
> 
> -- 
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT (ex-ZEDAT), Freie Universität Berlin
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

Attachment: smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to