On 2013-08-06 17:18, Ryan Cox wrote:
We made the mistake of setting TaskAffinity=yes, though I'm not sure why we did that. There seems to be a bug where the first node has cgroup/cpuset and task affinity set correctly, but subsequent nodes set task affinity for *all* tasks to be CPU 0. We hadn't gotten around to reporting it yet but it's worth checking out.
I have experimented a bit with task affinity lately, and in my experience the above happens when an MPI job compiled with Open MPI is launched with mpirun rather than "srun --mpi=openmpi". The reason is that when launching with mpirun, the mpirun executable executes srun to launch one "orted" task per "remote" node which in turn launches the actual MPI ranks on that node. However, slurm is unaware of this and binds orted to the first allocated core, and hence when orted launches the MPI processes they inherit the CPU affinity mask of the parent.
When launching with "srun --mpi=openmpi" slurm is aware how many tasks per node should be launched and the affinity goes correctly.
Another way in which stuff subtly breaks when using mpirun is that accounting goes wonky. If you check a running job with "sstat -j <jobid>" or a completed job with "sacct -l -j <jobid>" the number of tasks is incorrect, again because slurm thinks that the tasks are the "orted" processes rather than the actual MPI processes.
The same issues appears also when using mvapich2 and launching with mpirun, except there you have "hydra_pmi_proxy" instead of "orted". However, in contrast to open mpi, mvapich2 seems to override the affinity given to it and can use all the cores allocated to the job (limited by the cpuset generated when ConstrainCores=yes is set in cgroup.conf). Accounting is still incorrect, though.
-- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & BECS +358503841576 || janne.blomqv...@aalto.fi