Siegmar --

We've been talking about this offline.  Can you send us an lstopo output from 
your Solaris machine?  Send us the text output and the xml output, e.g.:

lstopo > solaris.txt
lstopo solaris.xml

Thanks!


On Feb 5, 2013, at 12:30 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi
> 
> now I can use all our machines once more. I have a problem on
> Solaris 10 x86_64, because the mapping of processes doesn't
> correspond to the rankfile. I removed the output from "hostfile"
> and wrapped around long lines.
> 
> tyr rankfiles 114 cat rf_ex_sunpc
> # mpiexec -report-bindings -rf rf_ex_sunpc hostname
> 
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
> 
> 
> tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
>  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
>  [B B][. .] (slot list 0:0-1)
> [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
>  socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
>  socket 1[core 0-1]: [B B][B B] (slot list 1:1)
> 
> 
> Can I provide any information to solve this problem? My
> rankfile works as expected, if I use only Linux machines.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
>>> Hmmm....well, it certainly works for me:
>>> 
>>> [rhc@odin ~/v1.6]$ cat rf
>>> rank 0=odin093 slot=0:0-1,1:0-1
>>> rank 1=odin094 slot=0:0-1
>>> rank 2=odin094 slot=1:0
>>> rank 3=odin094 slot=1:1
>>> 
>>> 
>>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
>>> -mca opal_paffinity_alone 0 hostname
>>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
>>>  socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 
> 0:0-1,1:0-1)
>>> odin093.cs.indiana.edu
>>> odin094.cs.indiana.edu
>>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
>>>  socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
>>> odin094.cs.indiana.edu
>>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
>>>  socket 1[core 0]: [. .][B .] (slot list 1:0)
>>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
>>>  socket 1[core 1]: [. .][. B] (slot list 1:1)
>>> odin094.cs.indiana.edu
>> 
>> Interesting that it works on your machines.
>> 
>> 
>>> I see one thing of concern to me in your output - your second node
>>> appears to be a Sun computer. Is it the same physical architecture?
>>> Is it also running Linux? Are you sure it is using the same version
>>> of OMPI, built for that environment and hardware?
>> 
>> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
>> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
>> Solaris 10 x86_64. All machines use the same version of Open MPI,
>> built for that environment. At the moment I can only use sunpc1 and
>> linpc1 ("my" developer machines). Next week I will have access to all
>> machines so that I can test, if I get a different behaviour when I
>> use two machines with the same operating system (although mixed
>> operating systems weren't a problem in the past (only machines with
>> differnt endians)). I let you know my results.
>> 
>> 
>> Kind regards
>> 
>> Siegmar
>> 
>> 
>> 
>> 
>>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross 
>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>> 
>>>> Hi
>>>> 
>>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
>>>> it works for my previous rankfile.
>>>> 
>>>> 
>>>>> #3493: Handle the case where rankfile provides the allocation
>>>>> -----------------------------------+----------------------------
>>>>> Reporter:  rhc                     |      Owner:  jsquyres
>>>>>   Type:  changeset move request  |     Status:  new
>>>>> Priority:  critical                |  Milestone:  Open MPI 1.6.4
>>>>> Version:  trunk                   |   Keywords:
>>>>> -----------------------------------+----------------------------
>>>>> Please apply the attached patch that corrects the rmaps function for
>>>>> obtaining the available nodes when rankfile is providing the 
> allocation.
>>>> 
>>>> 
>>>> tyr rankfiles 129 more rf_linpc1
>>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>> 
>>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
>>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
>>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>>>> 
>>>> 
>>>> 
>>>> Unfortunately I don't get the expected result for the following
>>>> rankfile.
>>>> 
>>>> tyr rankfiles 114 more rf_bsp 
>>>> # mpiexec -report-bindings -rf rf_bsp hostname
>>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>>> rank 1=sunpc1 slot=0:0-1
>>>> rank 2=sunpc1 slot=1:0
>>>> rank 3=sunpc1 slot=1:1
>>>> 
>>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
>>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
>>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
>>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
>>>> because they both get all four cores of sunpc1. Is something wrong
>>>> with my rankfile or with your mapping of processes to cores? I have
>>>> removed the output from "hostname" and wrapped long lines.
>>>> 
>>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
>>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
>>>> [B B][. .] (slot list 0:0-1)
>>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 1:0)
>>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 1:1)
>>>> 
>>>> 
>>>> I get the following output, if I add the options which you mentioned
>>>> in a previous email.
>>>> 
>>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
>>>> -display-allocation -mca ras_base_verbose 5 hostname
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>>> Querying component [cm]
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>>> Skipping component [cm]. Query failed to return a module
>>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>>> No component selected!
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> nothing found in module - proceeding to hostfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> parsing default hostfile
>>>>  /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>>> nothing found in hostfiles or dash-host - checking for rankfile
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert inserting 2 nodes
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert node linpc1
>>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>>> ras:base:node_insert node sunpc1
>>>> 
>>>> ======================   ALLOCATED NODES   ======================
>>>> 
>>>> Data for node: tyr.informatik.hs-fulda.de  Num slots: 0  Max slots: 0
>>>> Data for node: linpc1  Num slots: 1    Max slots: 0
>>>> Data for node: sunpc1  Num slots: 3    Max slots: 0
>>>> 
>>>> =================================================================
>>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
>>>> [B B][. .] (slot list 0:0-1)
>>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 1:0)
>>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
> 0-1]:
>>>> [B B][B B] (slot list 1:1)
>>>> 
>>>> 
>>>> Thank you very much for any suggestions and any help in advance.
>>>> 
>>>> 
>>>> Kind regards
>>>> 
>>>> Siegmar
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to