On Jan 31, 2013, at 12:39 PM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi
> 
>> Hmmm....well, it certainly works for me:
>> 
>> [rhc@odin ~/v1.6]$ cat rf
>> rank 0=odin093 slot=0:0-1,1:0-1
>> rank 1=odin094 slot=0:0-1
>> rank 2=odin094 slot=1:0
>> rank 3=odin094 slot=1:1
>> 
>> 
>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
>> -mca opal_paffinity_alone 0 hostname
>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
>>  socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>> odin093.cs.indiana.edu
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
>>  socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
>> odin094.cs.indiana.edu
>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
>>  socket 1[core 0]: [. .][B .] (slot list 1:0)
>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
>>  socket 1[core 1]: [. .][. B] (slot list 1:1)
>> odin094.cs.indiana.edu
> 
> Interesting that it works on your machines.
> 
> 
>> I see one thing of concern to me in your output - your second node
>> appears to be a Sun computer. Is it the same physical architecture?
>> Is it also running Linux? Are you sure it is using the same version
>> of OMPI, built for that environment and hardware?
> 
> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> Solaris 10 x86_64. All machines use the same version of Open MPI,
> built for that environment. At the moment I can only use sunpc1 and
> linpc1 ("my" developer machines). Next week I will have access to all
> machines so that I can test, if I get a different behaviour when I
> use two machines with the same operating system (although mixed
> operating systems weren't a problem in the past (only machines with
> differnt endians)). I let you know my results.

I suspect the problem is Solaris being on the remote machine. I don't know how 
far our Solaris support may have rotted by now.

> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> 
>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>> 
>>> Hi
>>> 
>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
>>> it works for my previous rankfile.
>>> 
>>> 
>>>> #3493: Handle the case where rankfile provides the allocation
>>>> -----------------------------------+----------------------------
>>>> Reporter:  rhc                     |      Owner:  jsquyres
>>>>   Type:  changeset move request  |     Status:  new
>>>> Priority:  critical                |  Milestone:  Open MPI 1.6.4
>>>> Version:  trunk                   |   Keywords:
>>>> -----------------------------------+----------------------------
>>>> Please apply the attached patch that corrects the rmaps function for
>>>> obtaining the available nodes when rankfile is providing the allocation.
>>> 
>>> 
>>> tyr rankfiles 129 more rf_linpc1
>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>> 
>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
>>> 
>>> 
>>> 
>>> Unfortunately I don't get the expected result for the following
>>> rankfile.
>>> 
>>> tyr rankfiles 114 more rf_bsp 
>>> # mpiexec -report-bindings -rf rf_bsp hostname
>>> rank 0=linpc1 slot=0:0-1,1:0-1
>>> rank 1=sunpc1 slot=0:0-1
>>> rank 2=sunpc1 slot=1:0
>>> rank 3=sunpc1 slot=1:1
>>> 
>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
>>> because they both get all four cores of sunpc1. Is something wrong
>>> with my rankfile or with your mapping of processes to cores? I have
>>> removed the output from "hostname" and wrapped long lines.
>>> 
>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>> 
>>> 
>>> I get the following output, if I add the options which you mentioned
>>> in a previous email.
>>> 
>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
>>> -display-allocation -mca ras_base_verbose 5 hostname
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>> Querying component [cm]
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>> Skipping component [cm]. Query failed to return a module
>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
>>> No component selected!
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in module - proceeding to hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> parsing default hostfile
>>>  /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
>>> nothing found in hostfiles or dash-host - checking for rankfile
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert inserting 2 nodes
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node linpc1
>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
>>> ras:base:node_insert node sunpc1
>>> 
>>> ======================   ALLOCATED NODES   ======================
>>> 
>>> Data for node: tyr.informatik.hs-fulda.de  Num slots: 0  Max slots: 0
>>> Data for node: linpc1  Num slots: 1    Max slots: 0
>>> Data for node: sunpc1  Num slots: 3    Max slots: 0
>>> 
>>> =================================================================
>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 0:0-1,1:0-1)
>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
>>> [B B][. .] (slot list 0:0-1)
>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:0)
>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
>>> [B B][B B] (slot list 1:1)
>>> 
>>> 
>>> Thank you very much for any suggestions and any help in advance.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 


Reply via email to