Hi Ralph, I've found the fix. Please check the attached patch file.
At this moment, nodes in hostfile should be listed in ascending order of slot size when we use "map-by node" or "map-by obj:span". The problem is that the hostfile created by Torque in our cluster always lists allocated nodes in descending order... Regards, Tetsuya Mishima (See attached file: patch.rr) > Hi Ralph, > > I did overall verification of rr_mapper, and I found another problem > with "map-by node". As far as I checked, "map-by obj" other than node > worked fine. I myself do not use "map-by node", but I'd like to report > it to improve reliability of 1.7.5. It seems too difficult for me to > resolve it. I hope you could take a look. > > The problem occurs when I mixedly use two kinds of node, although I > add "-hetero-nodes" to command line: > > [mishima@manage work]$ cat pbs_hosts > node04 slots=8 > node05 slots=2 > node06 slots=2 > > [mishima@manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node > -report-bindings -hetero-nodes /home/mishima/mi > s/openmpi/demos/myprog > [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file > rmaps_rr.c at line 241 > [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file > base/rmaps_base_map_job.c at line 285 > > With "-np 11", it works. But rank 10 is bound to the wrong core (which is > already used by rank 0). I guess something is wrong with the handling of > different topology when "map-by node" is specified. In addition, the > calculation of assigning procs to each node has some problems: > > [mishima@manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node > -report-bindings -hetero-nodes /home/mishima/mi > s/openmpi/demos/myprog > [node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]: > [./B/./././././.][./././././././.][./././././././.][ > ./././././././.] > [node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]: > [././B/././././.][./././././././.][./././././././.][ > ./././././././.] > [node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]: > [./././B/./././.][./././././././.][./././././././.][ > ./././././././.] > [node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]: > [B/././././././.][./././././././.][./././././././.] > [./././././././.] > [node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././.][./././././././.][./././././././.][ > ./././././././.] > [node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]: > [./B/./.][./././.] > [node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]: > [B/././.][./././.] > [node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]: > [./././B][./././.] > [node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]: > [B/././.][./././.] > [node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]: > [./B/./.][./././.] > [node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]: > [././B/.][./././.] > Hello world from process 4 of 11 > Hello world from process 7 of 11 > Hello world from process 6 of 11 > Hello world from process 3 of 11 > Hello world from process 0 of 11 > Hello world from process 8 of 11 > Hello world from process 2 of 11 > Hello world from process 5 of 11 > Hello world from process 9 of 11 > Hello world from process 1 of 11 > Hello world from process 10 of 11 > > Regards, > Tetsuya Mishima > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
patch.rr
Description: Binary data