Hi Ralph, I've found the fix. Please check the attached
patch file.

At this moment, nodes in hostfile should be listed in
ascending order of slot size when we use "map-by node" or
"map-by obj:span".

The problem is that the hostfile created by Torque in our
cluster always lists allocated nodes in descending order...

Regards,
Tetsuya Mishima

(See attached file: patch.rr)

> Hi Ralph,
>
> I did overall verification of rr_mapper, and I found another problem
> with "map-by node". As far as I checked, "map-by obj" other than node
> worked fine. I myself do not use "map-by node", but I'd like to report
> it to improve reliability of 1.7.5. It seems too difficult for me to
> resolve it. I hope you could take a look.
>
> The problem occurs when I mixedly use two kinds of node, although I
> add "-hetero-nodes" to command line:
>
> [mishima@manage work]$ cat pbs_hosts
> node04 slots=8
> node05 slots=2
> node06 slots=2
>
> [mishima@manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node
> -report-bindings -hetero-nodes /home/mishima/mi
> s/openmpi/demos/myprog
> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
> rmaps_rr.c at line 241
> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
> base/rmaps_base_map_job.c at line 285
>
> With "-np 11", it works. But rank 10 is bound to the wrong core (which is
> already used by rank 0). I guess something is wrong with the handling of
> different topology when "map-by node" is specified. In addition, the
> calculation of assigning procs to each node has some problems:
>
> [mishima@manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node
> -report-bindings -hetero-nodes /home/mishima/mi
> s/openmpi/demos/myprog
> [node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
> [./B/./././././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]:
> [././B/././././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]:
> [./././B/./././.][./././././././.][./././././././.][
> ./././././././.]
> [node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.][./././././././.]
> [./././././././.]
> [node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././.][./././././././.][./././././././.][
> ./././././././.]
> [node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]:
> [./B/./.][./././.]
> [node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
> [B/././.][./././.]
> [node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]:
> [./././B][./././.]
> [node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
> [B/././.][./././.]
> [node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]:
> [./B/./.][./././.]
> [node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]:
> [././B/.][./././.]
> Hello world from process 4 of 11
> Hello world from process 7 of 11
> Hello world from process 6 of 11
> Hello world from process 3 of 11
> Hello world from process 0 of 11
> Hello world from process 8 of 11
> Hello world from process 2 of 11
> Hello world from process 5 of 11
> Hello world from process 9 of 11
> Hello world from process 1 of 11
> Hello world from process 10 of 11
>
> Regards,
> Tetsuya Mishima
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: patch.rr
Description: Binary data

Reply via email to