Siegmar -- We've been talking about this offline. Can you send us an lstopo output from your Solaris machine? Send us the text output and the xml output, e.g.:
lstopo > solaris.txt lstopo solaris.xml Thanks! On Feb 5, 2013, at 12:30 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi > > now I can use all our machines once more. I have a problem on > Solaris 10 x86_64, because the mapping of processes doesn't > correspond to the rankfile. I removed the output from "hostfile" > and wrapped around long lines. > > tyr rankfiles 114 cat rf_ex_sunpc > # mpiexec -report-bindings -rf rf_ex_sunpc hostname > > rank 0=sunpc0 slot=0:0-1,1:0-1 > rank 1=sunpc1 slot=0:0-1 > rank 2=sunpc1 slot=1:0 > rank 3=sunpc1 slot=1:1 > > > tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname > [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]: > [B B][. .] (slot list 0:0-1) > [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] (slot list 1:0) > [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] (slot list 1:1) > > > Can I provide any information to solve this problem? My > rankfile works as expected, if I use only Linux machines. > > > Kind regards > > Siegmar > > > >>> Hmmm....well, it certainly works for me: >>> >>> [rhc@odin ~/v1.6]$ cat rf >>> rank 0=odin093 slot=0:0-1,1:0-1 >>> rank 1=odin094 slot=0:0-1 >>> rank 2=odin094 slot=1:0 >>> rank 3=odin094 slot=1:1 >>> >>> >>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings >>> -mca opal_paffinity_alone 0 hostname >>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to >>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list > 0:0-1,1:0-1) >>> odin093.cs.indiana.edu >>> odin094.cs.indiana.edu >>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to >>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) >>> odin094.cs.indiana.edu >>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to >>> socket 1[core 0]: [. .][B .] (slot list 1:0) >>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to >>> socket 1[core 1]: [. .][. B] (slot list 1:1) >>> odin094.cs.indiana.edu >> >> Interesting that it works on your machines. >> >> >>> I see one thing of concern to me in your output - your second node >>> appears to be a Sun computer. Is it the same physical architecture? >>> Is it also running Linux? Are you sure it is using the same version >>> of OMPI, built for that environment and hardware? >> >> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and >> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" >> Solaris 10 x86_64. All machines use the same version of Open MPI, >> built for that environment. At the moment I can only use sunpc1 and >> linpc1 ("my" developer machines). Next week I will have access to all >> machines so that I can test, if I get a different behaviour when I >> use two machines with the same operating system (although mixed >> operating systems weren't a problem in the past (only machines with >> differnt endians)). I let you know my results. >> >> >> Kind regards >> >> Siegmar >> >> >> >> >>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>> >>>> Hi >>>> >>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and >>>> it works for my previous rankfile. >>>> >>>> >>>>> #3493: Handle the case where rankfile provides the allocation >>>>> -----------------------------------+---------------------------- >>>>> Reporter: rhc | Owner: jsquyres >>>>> Type: changeset move request | Status: new >>>>> Priority: critical | Milestone: Open MPI 1.6.4 >>>>> Version: trunk | Keywords: >>>>> -----------------------------------+---------------------------- >>>>> Please apply the attached patch that corrects the rmaps function for >>>>> obtaining the available nodes when rankfile is providing the > allocation. >>>> >>>> >>>> tyr rankfiles 129 more rf_linpc1 >>>> # mpiexec -report-bindings -rf rf_linpc1 hostname >>>> rank 0=linpc1 slot=0:0-1,1:0-1 >>>> >>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname >>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] >>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >>>> >>>> >>>> >>>> Unfortunately I don't get the expected result for the following >>>> rankfile. >>>> >>>> tyr rankfiles 114 more rf_bsp >>>> # mpiexec -report-bindings -rf rf_bsp hostname >>>> rank 0=linpc1 slot=0:0-1,1:0-1 >>>> rank 1=sunpc1 slot=0:0-1 >>>> rank 2=sunpc1 slot=1:0 >>>> rank 3=sunpc1 slot=1:1 >>>> >>>> I would expect that rank 0 gets all four cores from linpc1, rank 1 >>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and >>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my >>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3, >>>> because they both get all four cores of sunpc1. Is something wrong >>>> with my rankfile or with your mapping of processes to cores? I have >>>> removed the output from "hostname" and wrapped long lines. >>>> >>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname >>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 0:0-1,1:0-1) >>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: >>>> [B B][. .] (slot list 0:0-1) >>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 1:0) >>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 1:1) >>>> >>>> >>>> I get the following output, if I add the options which you mentioned >>>> in a previous email. >>>> >>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ >>>> -display-allocation -mca ras_base_verbose 5 hostname >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>> Querying component [cm] >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>> Skipping component [cm]. Query failed to return a module >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>>> No component selected! >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>> nothing found in module - proceeding to hostfile >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>> parsing default hostfile >>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>>> nothing found in hostfiles or dash-host - checking for rankfile >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>> ras:base:node_insert inserting 2 nodes >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>> ras:base:node_insert node linpc1 >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>>> ras:base:node_insert node sunpc1 >>>> >>>> ====================== ALLOCATED NODES ====================== >>>> >>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0 >>>> Data for node: linpc1 Num slots: 1 Max slots: 0 >>>> Data for node: sunpc1 Num slots: 3 Max slots: 0 >>>> >>>> ================================================================= >>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 0:0-1,1:0-1) >>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]: >>>> [B B][. .] (slot list 0:0-1) >>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 1:0) >>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core > 0-1]: >>>> [B B][B B] (slot list 1:1) >>>> >>>> >>>> Thank you very much for any suggestions and any help in advance. >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/