Hi

> We've been talking about this offline.  Can you send us an lstopo
> output from your Solaris machine?  Send us the text output and
> the xml output, e.g.:
> 
> lstopo > solaris.txt
> lstopo solaris.xml

I have installed hwloc-1.3.2 and hwloc-1.6.1 and get the following
output (it's the same for both versions in the text file, but has
different xml files).


sunpc1 bin 121 lstopo --version
lstopo 1.3.2
sunpc1 bin 122 lstopo 
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)

sunpc1 bin 123 cd ../../hwloc-1.6.1/bin/
sunpc1 bin 124 lstopo --version
lstopo 1.6.1
sunpc1 bin 125 lstopo
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)
sunpc1 bin 126 


I have attached the requested files.

sunpc1 bin 144 lstopo --version
lstopo 1.3.2
sunpc1 bin 145 lstopo > /tmp/sunpc1-hwloc-1.3.2.txt
sunpc1 bin 146 lstopo --of xml > /tmp/sunpc1-hwloc-1.3.2.xml
sunpc1 bin 147 cd ../../hwloc-1.6.1/bin/
sunpc1 bin 148 lstopo --version
lstopo 1.6.1
sunpc1 bin 149 lstopo > /tmp/sunpc1-hwloc-1.6.1.txt
sunpc1 bin 150 lstopo --of xml > /tmp/sunpc1-hwloc-1.6.1.xml


Thank you very much for your help in advance.


Kind regards

Siegmar




> On Feb 5, 2013, at 12:30 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> > Hi
> > 
> > now I can use all our machines once more. I have a problem on
> > Solaris 10 x86_64, because the mapping of processes doesn't
> > correspond to the rankfile. I removed the output from "hostfile"
> > and wrapped around long lines.
> > 
> > tyr rankfiles 114 cat rf_ex_sunpc
> > # mpiexec -report-bindings -rf rf_ex_sunpc hostname
> > 
> > rank 0=sunpc0 slot=0:0-1,1:0-1
> > rank 1=sunpc1 slot=0:0-1
> > rank 2=sunpc1 slot=1:0
> > rank 3=sunpc1 slot=1:1
> > 
> > 
> > tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> > [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
> >  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> > [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
> >  [B B][. .] (slot list 0:0-1)
> > [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
> >  socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> > [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
> >  socket 1[core 0-1]: [B B][B B] (slot list 1:1)
> > 
> > 
> > Can I provide any information to solve this problem? My
> > rankfile works as expected, if I use only Linux machines.
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> > 
> > 
> >>> Hmmm....well, it certainly works for me:
> >>> 
> >>> [rhc@odin ~/v1.6]$ cat rf
> >>> rank 0=odin093 slot=0:0-1,1:0-1
> >>> rank 1=odin094 slot=0:0-1
> >>> rank 2=odin094 slot=1:0
> >>> rank 3=odin094 slot=1:1
> >>> 
> >>> 
> >>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
> >>> -mca opal_paffinity_alone 0 hostname
> >>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
> >>>  socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 
> > 0:0-1,1:0-1)
> >>> odin093.cs.indiana.edu
> >>> odin094.cs.indiana.edu
> >>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
> >>>  socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
> >>> odin094.cs.indiana.edu
> >>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
> >>>  socket 1[core 0]: [. .][B .] (slot list 1:0)
> >>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
> >>>  socket 1[core 1]: [. .][. B] (slot list 1:1)
> >>> odin094.cs.indiana.edu
> >> 
> >> Interesting that it works on your machines.
> >> 
> >> 
> >>> I see one thing of concern to me in your output - your second node
> >>> appears to be a Sun computer. Is it the same physical architecture?
> >>> Is it also running Linux? Are you sure it is using the same version
> >>> of OMPI, built for that environment and hardware?
> >> 
> >> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> >> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> >> Solaris 10 x86_64. All machines use the same version of Open MPI,
> >> built for that environment. At the moment I can only use sunpc1 and
> >> linpc1 ("my" developer machines). Next week I will have access to all
> >> machines so that I can test, if I get a different behaviour when I
> >> use two machines with the same operating system (although mixed
> >> operating systems weren't a problem in the past (only machines with
> >> differnt endians)). I let you know my results.
> >> 
> >> 
> >> Kind regards
> >> 
> >> Siegmar
> >> 
> >> 
> >> 
> >> 
> >>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross 
> >> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>> 
> >>>> Hi
> >>>> 
> >>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
> >>>> it works for my previous rankfile.
> >>>> 
> >>>> 
> >>>>> #3493: Handle the case where rankfile provides the allocation
> >>>>> -----------------------------------+----------------------------
> >>>>> Reporter:  rhc                     |      Owner:  jsquyres
> >>>>>   Type:  changeset move request  |     Status:  new
> >>>>> Priority:  critical                |  Milestone:  Open MPI 1.6.4
> >>>>> Version:  trunk                   |   Keywords:
> >>>>> -----------------------------------+----------------------------
> >>>>> Please apply the attached patch that corrects the rmaps function for
> >>>>> obtaining the available nodes when rankfile is providing the 
> > allocation.
> >>>> 
> >>>> 
> >>>> tyr rankfiles 129 more rf_linpc1
> >>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
> >>>> rank 0=linpc1 slot=0:0-1,1:0-1
> >>>> 
> >>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
> >>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
> >>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> >>>> 
> >>>> 
> >>>> 
> >>>> Unfortunately I don't get the expected result for the following
> >>>> rankfile.
> >>>> 
> >>>> tyr rankfiles 114 more rf_bsp 
> >>>> # mpiexec -report-bindings -rf rf_bsp hostname
> >>>> rank 0=linpc1 slot=0:0-1,1:0-1
> >>>> rank 1=sunpc1 slot=0:0-1
> >>>> rank 2=sunpc1 slot=1:0
> >>>> rank 3=sunpc1 slot=1:1
> >>>> 
> >>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
> >>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
> >>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
> >>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
> >>>> because they both get all four cores of sunpc1. Is something wrong
> >>>> with my rankfile or with your mapping of processes to cores? I have
> >>>> removed the output from "hostname" and wrapped long lines.
> >>>> 
> >>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
> >>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 0:0-1,1:0-1)
> >>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
> >>>> [B B][. .] (slot list 0:0-1)
> >>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 1:0)
> >>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 1:1)
> >>>> 
> >>>> 
> >>>> I get the following output, if I add the options which you mentioned
> >>>> in a previous email.
> >>>> 
> >>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
> >>>> -display-allocation -mca ras_base_verbose 5 hostname
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> >>>> Querying component [cm]
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> >>>> Skipping component [cm]. Query failed to return a module
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:(  ras)
> >>>> No component selected!
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> nothing found in module - proceeding to hostfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> parsing default hostfile
> >>>>  /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> nothing found in hostfiles or dash-host - checking for rankfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert inserting 2 nodes
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert node linpc1
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert node sunpc1
> >>>> 
> >>>> ======================   ALLOCATED NODES   ======================
> >>>> 
> >>>> Data for node: tyr.informatik.hs-fulda.de  Num slots: 0  Max slots: 0
> >>>> Data for node: linpc1  Num slots: 1    Max slots: 0
> >>>> Data for node: sunpc1  Num slots: 3    Max slots: 0
> >>>> 
> >>>> =================================================================
> >>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 0:0-1,1:0-1)
> >>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
> >>>> [B B][. .] (slot list 0:0-1)
> >>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 1:0)
> >>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 
> > 0-1]:
> >>>> [B B][B B] (slot list 1:1)
> >>>> 
> >>>> 
> >>>> Thank you very much for any suggestions and any help in advance.
> >>>> 
> >>>> 
> >>>> Kind regards
> >>>> 
> >>>> Siegmar
> >>>> 
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> 
> >>> 
> >> 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_level="-1" os_index="0" cpuset="0x0000000f" 
complete_cpuset="0x0000000f" online_cpuset="0x0000000f" 
allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" 
allowed_nodeset="0x00000006">
    <info name="OSName" value="SunOS"/>
    <info name="OSRelease" value="5.10"/>
    <info name="OSVersion" value="Generic_147441-21"/>
    <info name="HostName" value="sunpc1"/>
    <info name="Architecture" value="i86pc"/>
    <object type="NUMANode" os_level="-1" os_index="1" cpuset="0x00000003" 
complete_cpuset="0x00000003" online_cpuset="0x00000003" 
allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002" local_memory="4293435392">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_level="-1" os_index="0" cpuset="0x00000003" 
complete_cpuset="0x00000003" online_cpuset="0x00000003" 
allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
        <object type="Core" os_level="-1" os_index="0" cpuset="0x00000001" 
complete_cpuset="0x00000001" online_cpuset="0x00000001" 
allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
          <object type="PU" os_level="-1" os_index="0" cpuset="0x00000001" 
complete_cpuset="0x00000001" online_cpuset="0x00000001" 
allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002"/>
        </object>
        <object type="Core" os_level="-1" os_index="1" cpuset="0x00000002" 
complete_cpuset="0x00000002" online_cpuset="0x00000002" 
allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
          <object type="PU" os_level="-1" os_index="1" cpuset="0x00000002" 
complete_cpuset="0x00000002" online_cpuset="0x00000002" 
allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002"/>
        </object>
      </object>
    </object>
    <object type="NUMANode" os_level="-1" os_index="2" cpuset="0x0000000c" 
complete_cpuset="0x0000000c" online_cpuset="0x0000000c" 
allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004" local_memory="4294967296">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_level="-1" os_index="1" cpuset="0x0000000c" 
complete_cpuset="0x0000000c" online_cpuset="0x0000000c" 
allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
        <object type="Core" os_level="-1" os_index="2" cpuset="0x00000004" 
complete_cpuset="0x00000004" online_cpuset="0x00000004" 
allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
          <object type="PU" os_level="-1" os_index="2" cpuset="0x00000004" 
complete_cpuset="0x00000004" online_cpuset="0x00000004" 
allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004"/>
        </object>
        <object type="Core" os_level="-1" os_index="3" cpuset="0x00000008" 
complete_cpuset="0x00000008" online_cpuset="0x00000008" 
allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
          <object type="PU" os_level="-1" os_index="3" cpuset="0x00000008" 
complete_cpuset="0x00000008" online_cpuset="0x00000008" 
allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004"/>
        </object>
      </object>
    </object>
  </object>
</topology>
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000000f" 
complete_cpuset="0x0000000f" online_cpuset="0x0000000f" 
allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" 
allowed_nodeset="0x00000006">
    <info name="Backend" value="Solaris"/>
    <info name="OSName" value="SunOS"/>
    <info name="OSRelease" value="5.10"/>
    <info name="OSVersion" value="Generic_147441-21"/>
    <info name="HostName" value="sunpc1"/>
    <info name="Architecture" value="i86pc"/>
    <object type="NUMANode" os_index="1" cpuset="0x00000003" 
complete_cpuset="0x00000003" online_cpuset="0x00000003" 
allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002" local_memory="4293435392">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_index="0" cpuset="0x00000003" 
complete_cpuset="0x00000003" online_cpuset="0x00000003" 
allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
        <info name="CPUType" value=""/>
        <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/>
        <object type="Core" os_index="0" cpuset="0x00000001" 
complete_cpuset="0x00000001" online_cpuset="0x00000001" 
allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
          <object type="PU" os_index="0" cpuset="0x00000001" 
complete_cpuset="0x00000001" online_cpuset="0x00000001" 
allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002"/>
        </object>
        <object type="Core" os_index="1" cpuset="0x00000002" 
complete_cpuset="0x00000002" online_cpuset="0x00000002" 
allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002">
          <object type="PU" os_index="1" cpuset="0x00000002" 
complete_cpuset="0x00000002" online_cpuset="0x00000002" 
allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" 
allowed_nodeset="0x00000002"/>
        </object>
      </object>
    </object>
    <object type="NUMANode" os_index="2" cpuset="0x0000000c" 
complete_cpuset="0x0000000c" online_cpuset="0x0000000c" 
allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004" local_memory="4294967296">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_index="1" cpuset="0x0000000c" 
complete_cpuset="0x0000000c" online_cpuset="0x0000000c" 
allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
        <info name="CPUType" value=""/>
        <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/>
        <object type="Core" os_index="2" cpuset="0x00000004" 
complete_cpuset="0x00000004" online_cpuset="0x00000004" 
allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
          <object type="PU" os_index="2" cpuset="0x00000004" 
complete_cpuset="0x00000004" online_cpuset="0x00000004" 
allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004"/>
        </object>
        <object type="Core" os_index="3" cpuset="0x00000008" 
complete_cpuset="0x00000008" online_cpuset="0x00000008" 
allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004">
          <object type="PU" os_index="3" cpuset="0x00000008" 
complete_cpuset="0x00000008" online_cpuset="0x00000008" 
allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" 
allowed_nodeset="0x00000004"/>
        </object>
      </object>
    </object>
  </object>
</topology>

Reply via email to