Hi now I can use all our machines once more. I have a problem on Solaris 10 x86_64, because the mapping of processes doesn't correspond to the rankfile. I removed the output from "hostfile" and wrapped around long lines.
tyr rankfiles 114 cat rf_ex_sunpc # mpiexec -report-bindings -rf rf_ex_sunpc hostname rank 0=sunpc0 slot=0:0-1,1:0-1 rank 1=sunpc1 slot=0:0-1 rank 2=sunpc1 slot=1:0 rank 3=sunpc1 slot=1:1 tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 1:0) [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 1:1) Can I provide any information to solve this problem? My rankfile works as expected, if I use only Linux machines. Kind regards Siegmar > > Hmmm....well, it certainly works for me: > > > > [rhc@odin ~/v1.6]$ cat rf > > rank 0=odin093 slot=0:0-1,1:0-1 > > rank 1=odin094 slot=0:0-1 > > rank 2=odin094 slot=1:0 > > rank 3=odin094 slot=1:1 > > > > > > [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings > > -mca opal_paffinity_alone 0 hostname > > [odin093.cs.indiana.edu:04617] MCW rank 0 bound to > > socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > > odin093.cs.indiana.edu > > odin094.cs.indiana.edu > > [odin094.cs.indiana.edu:04426] MCW rank 1 bound to > > socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) > > odin094.cs.indiana.edu > > [odin094.cs.indiana.edu:04426] MCW rank 2 bound to > > socket 1[core 0]: [. .][B .] (slot list 1:0) > > [odin094.cs.indiana.edu:04426] MCW rank 3 bound to > > socket 1[core 1]: [. .][. B] (slot list 1:1) > > odin094.cs.indiana.edu > > Interesting that it works on your machines. > > > > I see one thing of concern to me in your output - your second node > > appears to be a Sun computer. Is it the same physical architecture? > > Is it also running Linux? Are you sure it is using the same version > > of OMPI, built for that environment and hardware? > > Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and > linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" > Solaris 10 x86_64. All machines use the same version of Open MPI, > built for that environment. At the moment I can only use sunpc1 and > linpc1 ("my" developer machines). Next week I will have access to all > machines so that I can test, if I get a different behaviour when I > use two machines with the same operating system (although mixed > operating systems weren't a problem in the past (only machines with > differnt endians)). I let you know my results. > > > Kind regards > > Siegmar > > > > > > On Jan 30, 2013, at 2:08 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > > > Hi > > > > > > I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and > > > it works for my previous rankfile. > > > > > > > > >> #3493: Handle the case where rankfile provides the allocation > > >> -----------------------------------+---------------------------- > > >> Reporter: rhc | Owner: jsquyres > > >> Type: changeset move request | Status: new > > >> Priority: critical | Milestone: Open MPI 1.6.4 > > >> Version: trunk | Keywords: > > >> -----------------------------------+---------------------------- > > >> Please apply the attached patch that corrects the rmaps function for > > >> obtaining the available nodes when rankfile is providing the allocation. > > > > > > > > > tyr rankfiles 129 more rf_linpc1 > > > # mpiexec -report-bindings -rf rf_linpc1 hostname > > > rank 0=linpc1 slot=0:0-1,1:0-1 > > > > > > tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname > > > [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] > > > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > > > > > > > > > > > > Unfortunately I don't get the expected result for the following > > > rankfile. > > > > > > tyr rankfiles 114 more rf_bsp > > > # mpiexec -report-bindings -rf rf_bsp hostname > > > rank 0=linpc1 slot=0:0-1,1:0-1 > > > rank 1=sunpc1 slot=0:0-1 > > > rank 2=sunpc1 slot=1:0 > > > rank 3=sunpc1 slot=1:1 > > > > > > I would expect that rank 0 gets all four cores from linpc1, rank 1 > > > both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and > > > rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my > > > processes with rank 0 and 1, but it's wrong for ranks 2 and 3, > > > because they both get all four cores of sunpc1. Is something wrong > > > with my rankfile or with your mapping of processes to cores? I have > > > removed the output from "hostname" and wrapped long lines. > > > > > > tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname > > > [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 0:0-1,1:0-1) > > > [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: > > > [B B][. .] (slot list 0:0-1) > > > [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 1:0) > > > [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 1:1) > > > > > > > > > I get the following output, if I add the options which you mentioned > > > in a previous email. > > > > > > tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ > > > -display-allocation -mca ras_base_verbose 5 hostname > > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > > > Querying component [cm] > > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > > > Skipping component [cm]. Query failed to return a module > > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > > > No component selected! > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > > > nothing found in module - proceeding to hostfile > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > > > parsing default hostfile > > > /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > > > nothing found in hostfiles or dash-host - checking for rankfile > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > > > ras:base:node_insert inserting 2 nodes > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > > > ras:base:node_insert node linpc1 > > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > > > ras:base:node_insert node sunpc1 > > > > > > ====================== ALLOCATED NODES ====================== > > > > > > Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0 > > > Data for node: linpc1 Num slots: 1 Max slots: 0 > > > Data for node: sunpc1 Num slots: 3 Max slots: 0 > > > > > > ================================================================= > > > [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 0:0-1,1:0-1) > > > [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]: > > > [B B][. .] (slot list 0:0-1) > > > [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 1:0) > > > [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: > > > [B B][B B] (slot list 1:1) > > > > > > > > > Thank you very much for any suggestions and any help in advance. > > > > > > > > > Kind regards > > > > > > Siegmar > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > >