Hi Dave, On Fri, Oct 27, 2017 at 3:57 PM, Dave Sizer <dsi...@nvidia.com> wrote: > Kilian, when you specify your CPU bindings in gres.conf, are you using the > same IDs that show up in nvidia-smi?
Yes: $ srun -p gpu -c 4 --gres gpu:1 --pty bash sh-114-01 $ cat /etc/slurm/gres.conf name=gpu File=/dev/nvidia[0-1] CPUs=0,2,4,6,8,10,12,14,16,18 name=gpu File=/dev/nvidia[2-3] CPUs=1,3,5,7,9,11,13,15,17,19 sh-114-01 $ nvidia-smi topo -m GPU0 mlx5_0 CPU Affinity GPU0 X PHB 0-0,4-4,8-8,12-12 mlx5_0 PHB X > We noticed that our CPU IDs were being remapped from their nvidia-smi values > by SLURM according to hwloc, so to get affinity working we needed to use > these remapped values. I don't think there's any remapping happening. Both Slurm (through hwloc) and nvidia-smi get the CPU IDs from the kernel, which takes them from the DMI pages and the BIOS. So they should all match, as they're all coming from the same source. Could you please elaborate on what makes you think the CPU ids are remapped somehow? > I'm wondering if --accel-bind=g is not using these same remappings, because > when our jobs hang with the option, slurmd.log reports "fatal: Invalid gres > data for gpu, CPUs=16-31". > But when we omit the option, we get no such error and everything seems to > work fine, including GPU affinity. We don't see such a hang, nor any similar error in slurmd.log, with or without --accel-bind=g. Do you have hyperthreading enabled by any chance? Are you positive you have all 32 CPUs available on that node? Cheers, -- Kilian