On Jan 31, 2013, at 12:39 PM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote:
> Hi > >> Hmmm....well, it certainly works for me: >> >> [rhc@odin ~/v1.6]$ cat rf >> rank 0=odin093 slot=0:0-1,1:0-1 >> rank 1=odin094 slot=0:0-1 >> rank 2=odin094 slot=1:0 >> rank 3=odin094 slot=1:1 >> >> >> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings >> -mca opal_paffinity_alone 0 hostname >> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to >> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >> odin093.cs.indiana.edu >> odin094.cs.indiana.edu >> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to >> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) >> odin094.cs.indiana.edu >> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to >> socket 1[core 0]: [. .][B .] (slot list 1:0) >> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to >> socket 1[core 1]: [. .][. B] (slot list 1:1) >> odin094.cs.indiana.edu > > Interesting that it works on your machines. > > >> I see one thing of concern to me in your output - your second node >> appears to be a Sun computer. Is it the same physical architecture? >> Is it also running Linux? Are you sure it is using the same version >> of OMPI, built for that environment and hardware? > > Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and > linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" > Solaris 10 x86_64. All machines use the same version of Open MPI, > built for that environment. At the moment I can only use sunpc1 and > linpc1 ("my" developer machines). Next week I will have access to all > machines so that I can test, if I get a different behaviour when I > use two machines with the same operating system (although mixed > operating systems weren't a problem in the past (only machines with > differnt endians)). I let you know my results. I suspect the problem is Solaris being on the remote machine. I don't know how far our Solaris support may have rotted by now. > > > Kind regards > > Siegmar > > > > >> On Jan 30, 2013, at 2:08 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi >>> >>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and >>> it works for my previous rankfile. >>> >>> >>>> #3493: Handle the case where rankfile provides the allocation >>>> -----------------------------------+---------------------------- >>>> Reporter: rhc | Owner: jsquyres >>>> Type: changeset move request | Status: new >>>> Priority: critical | Milestone: Open MPI 1.6.4 >>>> Version: trunk | Keywords: >>>> -----------------------------------+---------------------------- >>>> Please apply the attached patch that corrects the rmaps function for >>>> obtaining the available nodes when rankfile is providing the allocation. >>> >>> >>> tyr rankfiles 129 more rf_linpc1 >>> # mpiexec -report-bindings -rf rf_linpc1 hostname >>> rank 0=linpc1 slot=0:0-1,1:0-1 >>> >>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname >>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) >>> >>> >>> >>> Unfortunately I don't get the expected result for the following >>> rankfile. >>> >>> tyr rankfiles 114 more rf_bsp >>> # mpiexec -report-bindings -rf rf_bsp hostname >>> rank 0=linpc1 slot=0:0-1,1:0-1 >>> rank 1=sunpc1 slot=0:0-1 >>> rank 2=sunpc1 slot=1:0 >>> rank 3=sunpc1 slot=1:1 >>> >>> I would expect that rank 0 gets all four cores from linpc1, rank 1 >>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and >>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my >>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3, >>> because they both get all four cores of sunpc1. Is something wrong >>> with my rankfile or with your mapping of processes to cores? I have >>> removed the output from "hostname" and wrapped long lines. >>> >>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname >>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 0:0-1,1:0-1) >>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: >>> [B B][. .] (slot list 0:0-1) >>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:0) >>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:1) >>> >>> >>> I get the following output, if I add the options which you mentioned >>> in a previous email. >>> >>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ >>> -display-allocation -mca ras_base_verbose 5 hostname >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> Querying component [cm] >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> Skipping component [cm]. Query failed to return a module >>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) >>> No component selected! >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>> nothing found in module - proceeding to hostfile >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>> parsing default hostfile >>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate >>> nothing found in hostfiles or dash-host - checking for rankfile >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>> ras:base:node_insert inserting 2 nodes >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>> ras:base:node_insert node linpc1 >>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] >>> ras:base:node_insert node sunpc1 >>> >>> ====================== ALLOCATED NODES ====================== >>> >>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0 >>> Data for node: linpc1 Num slots: 1 Max slots: 0 >>> Data for node: sunpc1 Num slots: 3 Max slots: 0 >>> >>> ================================================================= >>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 0:0-1,1:0-1) >>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]: >>> [B B][. .] (slot list 0:0-1) >>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:0) >>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: >>> [B B][B B] (slot list 1:1) >>> >>> >>> Thank you very much for any suggestions and any help in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >