Hi > We've been talking about this offline. Can you send us an lstopo > output from your Solaris machine? Send us the text output and > the xml output, e.g.: > > lstopo > solaris.txt > lstopo solaris.xml
I have installed hwloc-1.3.2 and hwloc-1.6.1 and get the following output (it's the same for both versions in the text file, but has different xml files). sunpc1 bin 121 lstopo --version lstopo 1.3.2 sunpc1 bin 122 lstopo Machine (8191MB) NUMANode L#0 (P#1 4095MB) + Socket L#0 Core L#0 + PU L#0 (P#0) Core L#1 + PU L#1 (P#1) NUMANode L#1 (P#2 4096MB) + Socket L#1 Core L#2 + PU L#2 (P#2) Core L#3 + PU L#3 (P#3) sunpc1 bin 123 cd ../../hwloc-1.6.1/bin/ sunpc1 bin 124 lstopo --version lstopo 1.6.1 sunpc1 bin 125 lstopo Machine (8191MB) NUMANode L#0 (P#1 4095MB) + Socket L#0 Core L#0 + PU L#0 (P#0) Core L#1 + PU L#1 (P#1) NUMANode L#1 (P#2 4096MB) + Socket L#1 Core L#2 + PU L#2 (P#2) Core L#3 + PU L#3 (P#3) sunpc1 bin 126 I have attached the requested files. sunpc1 bin 144 lstopo --version lstopo 1.3.2 sunpc1 bin 145 lstopo > /tmp/sunpc1-hwloc-1.3.2.txt sunpc1 bin 146 lstopo --of xml > /tmp/sunpc1-hwloc-1.3.2.xml sunpc1 bin 147 cd ../../hwloc-1.6.1/bin/ sunpc1 bin 148 lstopo --version lstopo 1.6.1 sunpc1 bin 149 lstopo > /tmp/sunpc1-hwloc-1.6.1.txt sunpc1 bin 150 lstopo --of xml > /tmp/sunpc1-hwloc-1.6.1.xml Thank you very much for your help in advance. Kind regards Siegmar > On Feb 5, 2013, at 12:30 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > Hi > > > > now I can use all our machines once more. I have a problem on > > Solaris 10 x86_64, because the mapping of processes doesn't > > correspond to the rankfile. I removed the output from "hostfile" > > and wrapped around long lines. > > > > tyr rankfiles 114 cat rf_ex_sunpc > > # mpiexec -report-bindings -rf rf_ex_sunpc hostname > > > > rank 0=sunpc0 slot=0:0-1,1:0-1 > > rank 1=sunpc1 slot=0:0-1 > > rank 2=sunpc1 slot=1:0 > > rank 3=sunpc1 slot=1:1 > > > > > > tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname > > [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1] > > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > > [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]: > > [B B][. .] (slot list 0:0-1) > > [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1] > > socket 1[core 0-1]: [B B][B B] (slot list 1:0) > > [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1] > > socket 1[core 0-1]: [B B][B B] (slot list 1:1) > > > > > > Can I provide any information to solve this problem? My > > rankfile works as expected, if I use only Linux machines. > > > > > > Kind regards > > > > Siegmar > > > > > > > >>> Hmmm....well, it certainly works for me: > >>> > >>> [rhc@odin ~/v1.6]$ cat rf > >>> rank 0=odin093 slot=0:0-1,1:0-1 > >>> rank 1=odin094 slot=0:0-1 > >>> rank 2=odin094 slot=1:0 > >>> rank 3=odin094 slot=1:1 > >>> > >>> > >>> [rhc@odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings > >>> -mca opal_paffinity_alone 0 hostname > >>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to > >>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list > > 0:0-1,1:0-1) > >>> odin093.cs.indiana.edu > >>> odin094.cs.indiana.edu > >>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to > >>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) > >>> odin094.cs.indiana.edu > >>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to > >>> socket 1[core 0]: [. .][B .] (slot list 1:0) > >>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to > >>> socket 1[core 1]: [. .][. B] (slot list 1:1) > >>> odin094.cs.indiana.edu > >> > >> Interesting that it works on your machines. > >> > >> > >>> I see one thing of concern to me in your output - your second node > >>> appears to be a Sun computer. Is it the same physical architecture? > >>> Is it also running Linux? Are you sure it is using the same version > >>> of OMPI, built for that environment and hardware? > >> > >> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and > >> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc" > >> Solaris 10 x86_64. All machines use the same version of Open MPI, > >> built for that environment. At the moment I can only use sunpc1 and > >> linpc1 ("my" developer machines). Next week I will have access to all > >> machines so that I can test, if I get a different behaviour when I > >> use two machines with the same operating system (although mixed > >> operating systems weren't a problem in the past (only machines with > >> differnt endians)). I let you know my results. > >> > >> > >> Kind regards > >> > >> Siegmar > >> > >> > >> > >> > >>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross > >> <siegmar.gr...@informatik.hs-fulda.de> wrote: > >>> > >>>> Hi > >>>> > >>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and > >>>> it works for my previous rankfile. > >>>> > >>>> > >>>>> #3493: Handle the case where rankfile provides the allocation > >>>>> -----------------------------------+---------------------------- > >>>>> Reporter: rhc | Owner: jsquyres > >>>>> Type: changeset move request | Status: new > >>>>> Priority: critical | Milestone: Open MPI 1.6.4 > >>>>> Version: trunk | Keywords: > >>>>> -----------------------------------+---------------------------- > >>>>> Please apply the attached patch that corrects the rmaps function for > >>>>> obtaining the available nodes when rankfile is providing the > > allocation. > >>>> > >>>> > >>>> tyr rankfiles 129 more rf_linpc1 > >>>> # mpiexec -report-bindings -rf rf_linpc1 hostname > >>>> rank 0=linpc1 slot=0:0-1,1:0-1 > >>>> > >>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname > >>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1] > >>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > >>>> > >>>> > >>>> > >>>> Unfortunately I don't get the expected result for the following > >>>> rankfile. > >>>> > >>>> tyr rankfiles 114 more rf_bsp > >>>> # mpiexec -report-bindings -rf rf_bsp hostname > >>>> rank 0=linpc1 slot=0:0-1,1:0-1 > >>>> rank 1=sunpc1 slot=0:0-1 > >>>> rank 2=sunpc1 slot=1:0 > >>>> rank 3=sunpc1 slot=1:1 > >>>> > >>>> I would expect that rank 0 gets all four cores from linpc1, rank 1 > >>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and > >>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my > >>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3, > >>>> because they both get all four cores of sunpc1. Is something wrong > >>>> with my rankfile or with your mapping of processes to cores? I have > >>>> removed the output from "hostname" and wrapped long lines. > >>>> > >>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname > >>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 0:0-1,1:0-1) > >>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]: > >>>> [B B][. .] (slot list 0:0-1) > >>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 1:0) > >>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 1:1) > >>>> > >>>> > >>>> I get the following output, if I add the options which you mentioned > >>>> in a previous email. > >>>> > >>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \ > >>>> -display-allocation -mca ras_base_verbose 5 hostname > >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > >>>> Querying component [cm] > >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > >>>> Skipping component [cm]. Query failed to return a module > >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras) > >>>> No component selected! > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > >>>> nothing found in module - proceeding to hostfile > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > >>>> parsing default hostfile > >>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate > >>>> nothing found in hostfiles or dash-host - checking for rankfile > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > >>>> ras:base:node_insert inserting 2 nodes > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > >>>> ras:base:node_insert node linpc1 > >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] > >>>> ras:base:node_insert node sunpc1 > >>>> > >>>> ====================== ALLOCATED NODES ====================== > >>>> > >>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0 > >>>> Data for node: linpc1 Num slots: 1 Max slots: 0 > >>>> Data for node: sunpc1 Num slots: 3 Max slots: 0 > >>>> > >>>> ================================================================= > >>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 0:0-1,1:0-1) > >>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]: > >>>> [B B][. .] (slot list 0:0-1) > >>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 1:0) > >>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core > > 0-1]: > >>>> [B B][B B] (slot list 1:1) > >>>> > >>>> > >>>> Thank you very much for any suggestions and any help in advance. > >>>> > >>>> > >>>> Kind regards > >>>> > >>>> Siegmar > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > >
Machine (8191MB) NUMANode L#0 (P#1 4095MB) + Socket L#0 Core L#0 + PU L#0 (P#0) Core L#1 + PU L#1 (P#1) NUMANode L#1 (P#2 4096MB) + Socket L#1 Core L#2 + PU L#2 (P#2) Core L#3 + PU L#3 (P#3)
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE topology SYSTEM "hwloc.dtd"> <topology> <object type="Machine" os_level="-1" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" online_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" allowed_nodeset="0x00000006"> <info name="OSName" value="SunOS"/> <info name="OSRelease" value="5.10"/> <info name="OSVersion" value="Generic_147441-21"/> <info name="HostName" value="sunpc1"/> <info name="Architecture" value="i86pc"/> <object type="NUMANode" os_level="-1" os_index="1" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002" local_memory="4293435392"> <page_type size="4096" count="0"/> <object type="Socket" os_level="-1" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <object type="Core" os_level="-1" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <object type="PU" os_level="-1" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> </object> <object type="Core" os_level="-1" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <object type="PU" os_level="-1" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> </object> </object> </object> <object type="NUMANode" os_level="-1" os_index="2" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004" local_memory="4294967296"> <page_type size="4096" count="0"/> <object type="Socket" os_level="-1" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <object type="Core" os_level="-1" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <object type="PU" os_level="-1" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> </object> <object type="Core" os_level="-1" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <object type="PU" os_level="-1" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> </object> </object> </object> </object> </topology>
Machine (8191MB) NUMANode L#0 (P#1 4095MB) + Socket L#0 Core L#0 + PU L#0 (P#0) Core L#1 + PU L#1 (P#1) NUMANode L#1 (P#2 4096MB) + Socket L#1 Core L#2 + PU L#2 (P#2) Core L#3 + PU L#3 (P#3)
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE topology SYSTEM "hwloc.dtd"> <topology> <object type="Machine" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" online_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" allowed_nodeset="0x00000006"> <info name="Backend" value="Solaris"/> <info name="OSName" value="SunOS"/> <info name="OSRelease" value="5.10"/> <info name="OSVersion" value="Generic_147441-21"/> <info name="HostName" value="sunpc1"/> <info name="Architecture" value="i86pc"/> <object type="NUMANode" os_index="1" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002" local_memory="4293435392"> <page_type size="4096" count="0"/> <object type="Socket" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <info name="CPUType" value=""/> <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/> <object type="Core" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> </object> <object type="Core" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"> <object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/> </object> </object> </object> <object type="NUMANode" os_index="2" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004" local_memory="4294967296"> <page_type size="4096" count="0"/> <object type="Socket" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <info name="CPUType" value=""/> <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/> <object type="Core" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> </object> <object type="Core" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"> <object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/> </object> </object> </object> </object> </topology>