Hi

now I can use all our machines once more. I have a problem on
Solaris 10 x86_64, because the mapping of processes doesn't
correspond to the rankfile. I removed the output from "hostfile"
and wrapped around long lines.

tyr rankfiles 114 cat rf_ex_sunpc
# mpiexec -report-bindings -rf rf_ex_sunpc hostname

rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1


tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
[sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
[sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 1:0)
[sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 1:1)


Can I provide any information to solve this problem? My
rankfile works as expected, if I use only Linux machines.


Kind regards

Siegmar



> > Hmmm....well, it certainly works for me:
> > 
> > [rhc@odin ~/v1.6]$ cat rf
> > rank 0=odin093 slot=0:0-1,1:0-1
> > rank 1=odin094 slot=0:0-1
> > rank 2=odin094 slot=1:0
> > rank 3=odin094 slot=1:1
> > 
> > 
> > [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
> >  -mca opal_paffinity_alone 0 hostname
> > [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
> >   socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 
0:0-1,1:0-1)
> > odin093.cs.indiana.edu
> > odin094.cs.indiana.edu
> > [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
> >   socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
> > odin094.cs.indiana.edu
> > [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
> >   socket 1[core 0]: [. .][B .] (slot list 1:0)
> > [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
> >   socket 1[core 1]: [. .][. B] (slot list 1:1)
> > odin094.cs.indiana.edu
> 
> Interesting that it works on your machines.
> 
> 
> > I see one thing of concern to me in your output - your second node
> > appears to be a Sun computer. Is it the same physical architecture?
> > Is it also running Linux? Are you sure it is using the same version
> > of OMPI, built for that environment and hardware?
> 
> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> Solaris 10 x86_64. All machines use the same version of Open MPI,
> built for that environment. At the moment I can only use sunpc1 and
> linpc1 ("my" developer machines). Next week I will have access to all
> machines so that I can test, if I get a different behaviour when I
> use two machines with the same operating system (although mixed
> operating systems weren't a problem in the past (only machines with
> differnt endians)). I let you know my results.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> 
> > On Jan 30, 2013, at 2:08 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> > 
> > > Hi
> > > 
> > > I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
> > > it works for my previous rankfile.
> > > 
> > > 
> > >> #3493: Handle the case where rankfile provides the allocation
> > >> -----------------------------------+----------------------------
> > >> Reporter:  rhc                     |      Owner:  jsquyres
> > >>    Type:  changeset move request  |     Status:  new
> > >> Priority:  critical                |  Milestone:  Open MPI 1.6.4
> > >> Version:  trunk                   |   Keywords:
> > >> -----------------------------------+----------------------------
> > >> Please apply the attached patch that corrects the rmaps function for
> > >> obtaining the available nodes when rankfile is providing the 
allocation.
> > > 
> > > 
> > > tyr rankfiles 129 more rf_linpc1
> > > # mpiexec -report-bindings -rf rf_linpc1 hostname
> > > rank 0=linpc1 slot=0:0-1,1:0-1
> > > 
> > > tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
> > > [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
> > >  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> > > 
> > > 
> > > 
> > > Unfortunately I don't get the expected result for the following
> > > rankfile.
> > > 
> > > tyr rankfiles 114 more rf_bsp 
> > > # mpiexec -report-bindings -rf rf_bsp hostname
> > > rank 0=linpc1 slot=0:0-1,1:0-1
> > > rank 1=sunpc1 slot=0:0-1
> > > rank 2=sunpc1 slot=1:0
> > > rank 3=sunpc1 slot=1:1
> > > 
> > > I would expect that rank 0 gets all four cores from linpc1, rank 1
> > > both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
> > > rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
> > > processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
> > > because they both get all four cores of sunpc1. Is something wrong
> > > with my rankfile or with your mapping of processes to cores? I have
> > > removed the output from "hostname" and wrapped long lines.
> > > 
> > > tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
> > > [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 0:0-1,1:0-1)
> > > [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
> > >  [B B][. .] (slot list 0:0-1)
> > > [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 1:0)
> > > [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 1:1)
> > > 
> > > 
> > > I get the following output, if I add the options which you mentioned
> > > in a previous email.
> > > 
> > > tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
> > >  -display-allocation -mca ras_base_verbose 5 hostname
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> > >  Querying component [cm]
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> > >  Skipping component [cm]. Query failed to return a module
> > > [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> > >  No component selected!
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > >  nothing found in module - proceeding to hostfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > >  parsing default hostfile
> > >   /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> > >  nothing found in hostfiles or dash-host - checking for rankfile
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > >  ras:base:node_insert inserting 2 nodes
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > >  ras:base:node_insert node linpc1
> > > [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> > >  ras:base:node_insert node sunpc1
> > > 
> > > ======================   ALLOCATED NODES   ======================
> > > 
> > > Data for node: tyr.informatik.hs-fulda.de  Num slots: 0  Max slots: 0
> > > Data for node: linpc1  Num slots: 1    Max slots: 0
> > > Data for node: sunpc1  Num slots: 3    Max slots: 0
> > > 
> > > =================================================================
> > > [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 0:0-1,1:0-1)
> > > [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
> > >  [B B][. .] (slot list 0:0-1)
> > > [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 1:0)
> > > [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
0-1]:
> > >  [B B][B B] (slot list 1:1)
> > > 
> > > 
> > > Thank you very much for any suggestions and any help in advance.
> > > 
> > > 
> > > Kind regards
> > > 
> > > Siegmar
> > > 
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> 

Reply via email to