I'm having trouble seeing why it is failing, so I added some more debug output. 
Could you run the failure case again with -mca rmaps_base_verbose 10?

Thanks
Ralph

On Feb 27, 2014, at 6:11 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Just checking the difference, not so significant meaning...
> 
> Anyway, I guess it's due to the behavior when slot counts is missing
> (regarded as slots=1) and it's oversubscribed unintentionally.
> 
> I'm going out now, so I can't verify it quickly. If I provide the
> correct slot counts, it wll work, I guess. How do you think?
> 
> Tetsuya
> 
>> "restore" in what sense?
>> 
>> On Feb 27, 2014, at 4:10 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph, this is just for your information.
>>> 
>>> I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result
>>> below with this command line:
>>> 
>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2
>>> -display-map  -bind-to core:overload-allowed ~/mis/openmpi/demos/myprog
>>> Data for JOB [31184,1] offset 0
>>> 
>>> ========================   JOB MAP   ========================
>>> 
>>> Data for node: node05  Num slots: 1    Max slots: 0    Num procs: 7
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 0
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 2
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 4
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 6
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 1
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 3
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 5
>>> 
>>> Data for node: node06  Num slots: 1    Max slots: 0    Num procs: 1
>>>       Process OMPI jobid: [31184,1] App: 0 Process rank: 7
>>> 
>>> =============================================================
>>> [node06.cluster:18857] MCW rank 7 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node05.cluster:21399] MCW rank 4 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 5 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node05.cluster:21399] MCW rank 6 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node05.cluster:21399] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21399] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node05.cluster:21399] MCW rank 2 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> ....
>>> 
>>> 
>>> Then I add "-hostfile pbs_hosts" and the result is:
>>> 
>>> [mishima@manage work]$cat pbs_hosts
>>> node05 slots=8
>>> node06 slots=8
>>> [mishima@manage work]$ mpirun -np 8 -hostfile ~/work/pbs_hosts
>>> -report-bindings -map-by socket:pe=2 -display-map
>>> ~/mis/openmpi/demos/myprog
>>> Data for JOB [30254,1] offset 0
>>> 
>>> ========================   JOB MAP   ========================
>>> 
>>> Data for node: node05  Num slots: 8    Max slots: 0    Num procs: 4
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 0
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 2
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 1
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 3
>>> 
>>> Data for node: node06  Num slots: 8    Max slots: 0    Num procs: 4
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 4
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 6
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 5
>>>       Process OMPI jobid: [30254,1] App: 0 Process rank: 7
>>> 
>>> =============================================================
>>> [node05.cluster:21501] MCW rank 2 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node05.cluster:21501] MCW rank 3 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node05.cluster:21501] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node05.cluster:21501] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> [node06.cluster:18935] MCW rank 6 bound to socket 0[core 2[hwt 0]],
> socket
>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>> [node06.cluster:18935] MCW rank 7 bound to socket 1[core 6[hwt 0]],
> socket
>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>> [node06.cluster:18935] MCW rank 4 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>> [node06.cluster:18935] MCW rank 5 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>> ....
>>> 
>>> 
>>> I think previous version's behavior would be close to what I expect.
>>> 
>>> Tetusya
>>> 
>>>> They have 4 cores/socket and 2 sockets, totally 4 X 2 = 8 cores, each.
>>>> 
>>>> Here is the output of lstopo.
>>>> 
>>>> mishima@manage round_robin]$ rsh node05
>>>> Last login: Tue Feb 18 15:10:15 from manage
>>>> [mishima@node05 ~]$ lstopo
>>>> Machine (32GB)
>>>> NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB)
>>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
>>>> (P#0)
>>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
>>>> (P#1)
>>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
>>>> (P#2)
>>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
>>>> (P#3)
>>>> NUMANode L#1 (P#1 16GB) + Socket L#1 + L3 L#1 (6144KB)
>>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
>>>> (P#4)
>>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
>>>> (P#5)
>>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
>>>> (P#6)
>>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
>>>> (P#7)
>>>> ....
>>>> 
>>>> I foucused on byobj_span and bynode. I didn't notice byobj was
> modified,
>>>> sorry.
>>>> 
>>>> Tetsuya
>>>> 
>>>>> Hmmm..what does your node look like again (sockets and cores)?
>>>>> 
>>>>> On Feb 27, 2014, at 3:19 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>> 
>>>>>> 
>>>>>> Hi Ralph, I'm afraid to say your new "map-by obj" causes another
>>>> problem.
>>>>>> 
>>>>>> I have overload message with this command line as shown below:
>>>>>> 
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
> socket:pe=2
>>>>>> -display-map ~/mis/openmpi/d
>>>>>> emos/myprog
>>>>>> 
>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>> 
>>>>>> Bind to:         CORE
>>>>>> Node:            node05
>>>>>> #processes:  2
>>>>>> #cpus:          1
>>>>>> 
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>> 
>>>> 
>>> 
> --------------------------------------------------------------------------
>>>>>> 
>>>>>> Then, I add "-bind-to core:overload-allowed" to see what happenes.
>>>>>> 
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
> socket:pe=2
>>>>>> -display-map -bind-to core:o
>>>>>> verload-allowed ~/mis/openmpi/demos/myprog
>>>>>> Data for JOB [14398,1] offset 0
>>>>>> 
>>>>>> ========================   JOB MAP   ========================
>>>>>> 
>>>>>> Data for node: node05  Num slots: 1    Max slots: 0    Num procs: 4
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 0
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 1
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 2
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 3
>>>>>> 
>>>>>> Data for node: node06  Num slots: 1    Max slots: 0    Num procs: 4
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 4
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 5
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 6
>>>>>>      Process OMPI jobid: [14398,1] App: 0 Process rank: 7
>>>>>> 
>>>>>> =============================================================
>>>>>> [node06.cluster:18443] MCW rank 6 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20901] MCW rank 2 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18443] MCW rank 7 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20901] MCW rank 3 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node06.cluster:18443] MCW rank 4 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20901] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18443] MCW rank 5 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20901] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> Hello world from process 4 of 8
>>>>>> Hello world from process 2 of 8
>>>>>> Hello world from process 6 of 8
>>>>>> Hello world from process 0 of 8
>>>>>> Hello world from process 5 of 8
>>>>>> Hello world from process 1 of 8
>>>>>> Hello world from process 7 of 8
>>>>>> Hello world from process 3 of 8
>>>>>> 
>>>>>> When I add "map-by obj:span", it works fine:
>>>>>> 
>>>>>> mpirun -np 8 -host node05,node06 -report-bindings -map-by
>>>> socket:pe=2,span
>>>>>> -display-map  ~/mis/ope
>>>>>> nmpi/demos/myprog
>>>>>> Data for JOB [14703,1] offset 0
>>>>>> 
>>>>>> ========================   JOB MAP   ========================
>>>>>> 
>>>>>> Data for node: node05  Num slots: 1    Max slots: 0    Num procs: 4
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 0
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 2
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 1
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 3
>>>>>> 
>>>>>> Data for node: node06  Num slots: 1    Max slots: 0    Num procs: 4
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 4
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 6
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 5
>>>>>>      Process OMPI jobid: [14703,1] App: 0 Process rank: 7
>>>>>> 
>>>>>> =============================================================
>>>>>> [node06.cluster:18491] MCW rank 6 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node05.cluster:20949] MCW rank 2 bound to socket 0[core 2[hwt 0]],
>>>> socket
>>>>>> 0[core 3[hwt 0]]: [././B/B][./././.]
>>>>>> [node06.cluster:18491] MCW rank 7 bound to socket 1[core 6[hwt 0]],
>>>> socket
>>>>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>>>>> [node05.cluster:20949] MCW rank 3 bound to socket 1[core 6[hwt 0]],
>>>> socket
>>>>>> 1[core 7[hwt 0]]: [./././.][././B/B]
>>>>>> [node06.cluster:18491] MCW rank 4 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node05.cluster:20949] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>> socket
>>>>>> 0[core 1[hwt 0]]: [B/B/./.][./././.]
>>>>>> [node06.cluster:18491] MCW rank 5 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>>>>> [node05.cluster:20949] MCW rank 1 bound to socket 1[core 4[hwt 0]],
>>>> socket
>>>>>> 1[core 5[hwt 0]]: [./././.][B/B/./.]
>>>>>> ....
>>>>>> 
>>>>>> So, byobj_span would be okay. Of course, bynode and byslot should be
>>>> okay.
>>>>>> Could you take a look at orte_rmaps_rr_byobj again?
>>>>>> 
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to