[slurm-dev] Re: cgroups usage

Janne Blomqvist Tue, 06 Aug 2013 08:14:58 -0700


On 2013-08-06 17:18, Ryan Cox wrote:

We made the mistake of setting TaskAffinity=yes, though I'm not sure why
we did that.  There seems to be a bug where the first node has
cgroup/cpuset and task affinity set correctly, but subsequent nodes set
task affinity for *all* tasks to be CPU 0.  We hadn't gotten around to
reporting it yet but it's worth checking out.

I have experimented a bit with task affinity lately, and in myexperience the above happens when an MPI job compiled with Open MPI islaunched with mpirun rather than "srun --mpi=openmpi". The reason isthat when launching with mpirun, the mpirun executable executes srun tolaunch one "orted" task per "remote" node which in turn launches theactual MPI ranks on that node. However, slurm is unaware of this andbinds orted to the first allocated core, and hence when orted launchesthe MPI processes they inherit the CPU affinity mask of the parent.

When launching with "srun --mpi=openmpi" slurm is aware how many tasksper node should be launched and the affinity goes correctly.

Another way in which stuff subtly breaks when using mpirun is thataccounting goes wonky. If you check a running job with "sstat -j<jobid>" or a completed job with "sacct -l -j <jobid>" the number oftasks is incorrect, again because slurm thinks that the tasks are the"orted" processes rather than the actual MPI processes.

The same issues appears also when using mvapich2 and launching withmpirun, except there you have "hydra_pmi_proxy" instead of "orted".However, in contrast to open mpi, mvapich2 seems to override theaffinity given to it and can use all the cores allocated to the job(limited by the cpuset generated when ConstrainCores=yes is set incgroup.conf). Accounting is still incorrect, though.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || janne.blomqv...@aalto.fi

[slurm-dev] Re: cgroups usage

Reply via email to