Hi Ralph,

I did overall verification of rr_mapper, and I found another problem
with "map-by node". As far as I checked, "map-by obj" other than node
worked fine. I myself do not use "map-by node", but I'd like to report
it to improve reliability of 1.7.5. It seems too difficult for me to
resolve it. I hope you could take a look.

The problem occurs when I mixedly use two kinds of node, although I
add "-hetero-nodes" to command line:

[mishima@manage work]$ cat pbs_hosts
node04 slots=8
node05 slots=2
node06 slots=2

[mishima@manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node
-report-bindings -hetero-nodes /home/mishima/mi
s/openmpi/demos/myprog
[manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
rmaps_rr.c at line 241
[manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
base/rmaps_base_map_job.c at line 285

With "-np 11", it works. But rank 10 is bound to the wrong core (which is
already used by rank 0). I guess something is wrong with the handling of
different topology when "map-by node" is specified. In addition, the
calculation of assigning procs to each node has some problems:

[mishima@manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node
-report-bindings -hetero-nodes /home/mishima/mi
s/openmpi/demos/myprog
[node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
[./B/./././././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]:
[././B/././././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]:
[./././B/./././.][./././././././.][./././././././.][
./././././././.]
[node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.][./././././././.]
[./././././././.]
[node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.][./././././././.][
./././././././.]
[node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
[node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]:
[./././B][./././.]
[node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././.][./././.]
[node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]:
[./B/./.][./././.]
[node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]:
[././B/.][./././.]
Hello world from process 4 of 11
Hello world from process 7 of 11
Hello world from process 6 of 11
Hello world from process 3 of 11
Hello world from process 0 of 11
Hello world from process 8 of 11
Hello world from process 2 of 11
Hello world from process 5 of 11
Hello world from process 9 of 11
Hello world from process 1 of 11
Hello world from process 10 of 11

Regards,
Tetsuya Mishima

Reply via email to