I'm afraid that patch didn't solve the problem when I tested it - it resolved a 
problem of cpus-per-rank > 1, but not the case of descending order of slots. 
Took a little more work, but I believe the patch in r30798 (based on yours) 
completes the job.

FWIW: the "hetero-nodes" flag is a bit of a red herring here. That flag 
indicates a difference in physical topology between the nodes, not a difference 
in number of assigned slots. It would be required if your nodes actually have 
different numbers of cores on them (e.g., have different chips) and you want us 
to bind the processes or map by some object lower than the node level, but only 
in such cases.

Appreciate your help! I scheduled the patch for 1.7.5 and assigned it to you 
for verification - please let me know if you don't haver time to do so.

https://svn.open-mpi.org/trac/ompi/ticket/4296

Ralph

On Feb 19, 2014, at 5:00 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph, I've found the fix. Please check the attached
> patch file.
> 
> At this moment, nodes in hostfile should be listed in
> ascending order of slot size when we use "map-by node" or
> "map-by obj:span".
> 
> The problem is that the hostfile created by Torque in our
> cluster always lists allocated nodes in descending order...
> 
> Regards,
> Tetsuya Mishima
> 
> (See attached file: patch.rr)
> 
>> Hi Ralph,
>> 
>> I did overall verification of rr_mapper, and I found another problem
>> with "map-by node". As far as I checked, "map-by obj" other than node
>> worked fine. I myself do not use "map-by node", but I'd like to report
>> it to improve reliability of 1.7.5. It seems too difficult for me to
>> resolve it. I hope you could take a look.
>> 
>> The problem occurs when I mixedly use two kinds of node, although I
>> add "-hetero-nodes" to command line:
>> 
>> [mishima@manage work]$ cat pbs_hosts
>> node04 slots=8
>> node05 slots=2
>> node06 slots=2
>> 
>> [mishima@manage work]$ mpirun -np 12 -machinefile pbs_hosts -map-by node
>> -report-bindings -hetero-nodes /home/mishima/mi
>> s/openmpi/demos/myprog
>> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
>> rmaps_rr.c at line 241
>> [manage.cluster:13113] [[15682,0],0] ORTE_ERROR_LOG: Fatal in file
>> base/rmaps_base_map_job.c at line 285
>> 
>> With "-np 11", it works. But rank 10 is bound to the wrong core (which is
>> already used by rank 0). I guess something is wrong with the handling of
>> different topology when "map-by node" is specified. In addition, the
>> calculation of assigning procs to each node has some problems:
>> 
>> [mishima@manage work]$ mpirun -np 11 -machinefile pbs_hosts -map-by node
>> -report-bindings -hetero-nodes /home/mishima/mi
>> s/openmpi/demos/myprog
>> [node04.cluster:13384] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
>> [./B/./././././.][./././././././.][./././././././.][
>> ./././././././.]
>> [node04.cluster:13384] MCW rank 6 bound to socket 0[core 2[hwt 0]]:
>> [././B/././././.][./././././././.][./././././././.][
>> ./././././././.]
>> [node04.cluster:13384] MCW rank 8 bound to socket 0[core 3[hwt 0]]:
>> [./././B/./././.][./././././././.][./././././././.][
>> ./././././././.]
>> [node04.cluster:13384] MCW rank 10 bound to socket 0[core 0[hwt 0]]:
>> [B/././././././.][./././././././.][./././././././.]
>> [./././././././.]
>> [node04.cluster:13384] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>> [B/././././././.][./././././././.][./././././././.][
>> ./././././././.]
>> [node06.cluster:24192] MCW rank 5 bound to socket 0[core 1[hwt 0]]:
>> [./B/./.][./././.]
>> [node06.cluster:24192] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
>> [B/././.][./././.]
>> [node05.cluster:25655] MCW rank 9 bound to socket 0[core 3[hwt 0]]:
>> [./././B][./././.]
>> [node05.cluster:25655] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
>> [B/././.][./././.]
>> [node05.cluster:25655] MCW rank 4 bound to socket 0[core 1[hwt 0]]:
>> [./B/./.][./././.]
>> [node05.cluster:25655] MCW rank 7 bound to socket 0[core 2[hwt 0]]:
>> [././B/.][./././.]
>> Hello world from process 4 of 11
>> Hello world from process 7 of 11
>> Hello world from process 6 of 11
>> Hello world from process 3 of 11
>> Hello world from process 0 of 11
>> Hello world from process 8 of 11
>> Hello world from process 2 of 11
>> Hello world from process 5 of 11
>> Hello world from process 9 of 11
>> Hello world from process 1 of 11
>> Hello world from process 10 of 11
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> <patch.rr>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to