Dear Slurm Users,

we've observed a strange issue with oversubscription, namely cores being shared by multiple jobs.

We are using the CR_CPU_Memory resource selection plugin, which unlike CR_Memory doesn't enforce oversubscription, a short partition check confirms this:

$ scontrol show part | grep -o 'OverSubscribe=.*' | sort -u
OverSubscribe=NO

However, oversubscription occurs, as seen in this example where a single core is used by two jobs by two different users (user data anonymized):

/cgroup/cpuset/slurm/uid_123/job_10022564/cpus
8
/cgroup/cpuset/slurm/uid_456/job_10009002/cpus
8

As a consequence, they can only use the core up to 50%, which hinders performance ('top' output):
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+   P COMMAND
1913 userx  20   0  125m  31m 4100 R 49.9  0.1 725:50.53  8 AppX
15480 usery  20   0  815m 163m  17m R 49.9  0.7  40:51.05  8 AppY

When checking the jobs with squeue, the 'OVER_SUBSCRIBE' attribute says 'OK' which according to the manual should mean dedicated allocation:

$ squeue -j 10022564,10009002 -O jobid,oversubscribe
JOBID               OVER_SUBSCRIBE
10009002            OK
10022564            OK

Any ideas why the cores are shared rather than dedicated to each job?
We are using cgroup plugins where applicable:

...
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/cgroup
...

there's no preemption and the cgroup.conf looks like this:

CgroupAutomount=yes
CgroupMountpoint=/cgroup
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainKmemSpace=yes
AllowedSwapSpace=0

Kind regards,
Lech



--
Lech Nieroda
Zentrum für Angewandte Informatik (ZAIK/RRZK)
Universität zu Köln
Robert-Koch-Str. 10
Gebäude 55 (RRZK-R2), Raum 210 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 478-7021
Fax: +49 (221) 478-5568
E-Mail: nieroda.l...@uni-koeln.de


Reply via email to