Hi,

I'm still trying to figure out how to express the core binding I want to openmpi 2.x via the --map-by option. Can anyone help, please?

I bet I'm being dumb, but it's proving tricky to achieve the following aims (most important first):

1) Maximise memory bandwidth usage (e.g. load balance ranks across
   processor sockets)
2) Optimise for nearest-neighbour comms (in MPI_COMM_WORLD) (e.g. put
   neighbouring ranks on the same socket)
3) Have an incantation that's simple to change based on number of ranks
   and processes per rank I want.

Example:

Considering a 2 socket, 12 cores/socket box and a program with 2 threads per rank...

... this is great if I fully-populate the node:

$ mpirun -np 12 -map-by slot:PE=2 --bind-to core --report-bindings ./prog
[somehost:101235] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:101235] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:101235] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:101235] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 
7[hwt 0]]: [././././././B/B/./././.][./././././././././././.]
[somehost:101235] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 
9[hwt 0]]: [././././././././B/B/./.][./././././././././././.]
[somehost:101235] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 
11[hwt 0]]: [././././././././././B/B][./././././././././././.]
[somehost:101235] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket 1[core 
13[hwt 0]]: [./././././././././././.][B/B/./././././././././.]
[somehost:101235] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket 1[core 
15[hwt 0]]: [./././././././././././.][././B/B/./././././././.]
[somehost:101235] MCW rank 8 bound to socket 1[core 16[hwt 0]], socket 1[core 
17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]
[somehost:101235] MCW rank 9 bound to socket 1[core 18[hwt 0]], socket 1[core 
19[hwt 0]]: [./././././././././././.][././././././B/B/./././.]
[somehost:101235] MCW rank 10 bound to socket 1[core 20[hwt 0]], socket 1[core 
21[hwt 0]]: [./././././././././././.][././././././././B/B/./.]
[somehost:101235] MCW rank 11 bound to socket 1[core 22[hwt 0]], socket 1[core 
23[hwt 0]]: [./././././././././././.][././././././././././B/B]


... but not if I don't [fails aim (1)]:

$ mpirun -np 6 -map-by slot:PE=2 --bind-to core --report-bindings ./prog
[somehost:102035] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:102035] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:102035] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:102035] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket 0[core 
7[hwt 0]]: [././././././B/B/./././.][./././././././././././.]
[somehost:102035] MCW rank 4 bound to socket 0[core 8[hwt 0]], socket 0[core 
9[hwt 0]]: [././././././././B/B/./.][./././././././././././.]
[somehost:102035] MCW rank 5 bound to socket 0[core 10[hwt 0]], socket 0[core 
11[hwt 0]]: [././././././././././B/B][./././././././././././.]


... whereas if I map by socket instead of slot, I achieve aim (1) but fail on aim (2):

$ mpirun -np 6 -map-by socket:PE=2 --bind-to core --report-bindings ./prog
[somehost:105601] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././././././././.][./././././././././././.]
[somehost:105601] MCW rank 1 bound to socket 1[core 12[hwt 0]], socket 1[core 
13[hwt 0]]: [./././././././././././.][B/B/./././././././././.]
[somehost:105601] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./././././././.][./././././././././././.]
[somehost:105601] MCW rank 3 bound to socket 1[core 14[hwt 0]], socket 1[core 
15[hwt 0]]: [./././././././././././.][././B/B/./././././././.]
[somehost:105601] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
[somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket 1[core 
17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]


Any ideas, please?

Thanks,

Mark
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to