We actually include hwloc v1.3.2 in the OMPI v1.6 series.

Can you download and try that on your machines?  

      http://www.open-mpi.org/software/hwloc/v1.3/

In particular try the hwloc-bind executable (outside of OMPI), and see if 
binding works properly on your machines.  I typically run a test script when 
I'm testing binding:

------
[12:59] svbu-mpi059:~/mpi % lstopo --no-io
Machine (64GB)
  NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#16)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#17)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
      PU L#4 (P#2)
      PU L#5 (P#18)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#19)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
      PU L#8 (P#4)
      PU L#9 (P#20)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
      PU L#10 (P#5)
      PU L#11 (P#21)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
      PU L#12 (P#6)
      PU L#13 (P#22)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#23)
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
      PU L#16 (P#8)
      PU L#17 (P#24)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
      PU L#18 (P#9)
      PU L#19 (P#25)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
      PU L#20 (P#10)
      PU L#21 (P#26)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#27)
    L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
      PU L#24 (P#12)
      PU L#25 (P#28)
    L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
      PU L#26 (P#13)
      PU L#27 (P#29)
    L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#30)
    L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#31)
[12:59] svbu-mpi059:~/mpi % hwloc-bind socket:1.core:5 -l ./report-bindings.sh
MCW rank  (svbu-mpi059): Socket:1.Core:5.PU:13 Socket:1.Core:5.PU:29
[13:00] svbu-mpi059:~/mpi % cat report-bindings.sh
#!/bin/sh

bitmap=`hwloc-bind --get -p`
friendly=`hwloc-calc -p -H socket.core.pu $bitmap`

echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
exit 0
[13:00] svbu-mpi059:~/mpi % 
-----

Try just running hwloc-bind and binding yourself to some logical location, and 
run my report-bindings.sh script, and see if the physical indexes that it 
outputs are correct.



On Sep 10, 2012, at 7:34 AM, Siegmar Gross wrote:

> Hi,
> 
>>> are the following outputs helpful to find the error with
>>> a rankfile on Solaris?
>> 
>> If you can't bind on the new Solaris machine, then the rankfile
>> won't do you any good. It looks like we are getting the incorrect
>> number of cores on that machine - is it possible that it has
>> hardware threads, and doesn't report "cores"? Can you download
>> and run a copy of lstopo to check the output? You get that from
>> the hwloc folks:
>> 
>> http://www.open-mpi.org/software/hwloc/v1.5/
> 
> I downloaded and installed the package on our machines. Perhaps it is
> easier to detect the error if you have more information. Therefore I
> provide the different hardware architecures of all machines on which
> a simple program breaks if I try to bind processes to sockets or cores.
> 
> I tried the following five commands with "h" one of "tyr", "rs0",
> "linpc0", "linpc1", "linpc2", "linpc4", "sunpc0", "sunpc1",
> "sunpc2", or "sunpc4" in a shell script file which I started on
> my local machine ("tyr"). "works on" means that the small program
> (MPI_Init, printf, MPI_Finalize) didn't break. I didn't check if
> the layout of the processes was correct.
> 
> 
> mpiexec -report-bindings -np 4 -host h init_finalize
> 
> works on:  tyr, rs0, linpc0, linpc1, linpc2, linpc4, sunpc0, sunpc1,
>           sunpc2, sunpc4
> breaks on: -
> 
> 
> mpiexec -report-bindings -np 4 -host h -bind-to-core -bycore init_finalize
> 
> works on:  linpc2, sunpc1
> breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
> 
> 
> mpiexec -report-bindings -np 4 -host h -bind-to-core -bysocket init_finalize
> 
> works on:  linpc2, sunpc1
> breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
> 
> 
> mpiexec -report-bindings -np 4 -host h -bind-to-socket -bycore init_finalize
> 
> works on:  tyr, linpc1, linpc2, sunpc1, sunpc2
> breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
> 
> 
> mpiexec -report-bindings -np 4 -host h -bind-to-socket -bysocket init_finalize
> 
> works on:  tyr, linpc1, linpc2, sunpc1, sunpc2
> breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
> 
> 
> 
> "lstopo" shows the following hardware configurations for the above
> machines. The first line always shows the installed architecture.
> "lstopo" does a good job as far as I can see it.
> 
> tyr:
> ----
> 
> UltraSPARC-IIIi, 2 single core processors, no hardware threads
> 
> tyr fd1026 183 lstopo
> Machine (4096MB)
>  NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0)
>  NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1)
> 
> tyr fd1026 116 psrinfo -pv
> The physical processor has 1 virtual processor (0)
>  UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1600 MHz)
> The physical processor has 1 virtual processor (1)
>  UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1600 MHz)
> 
> 
> rs0, rs1:
> ---------
> 
> SPARC64-VII, 2 quad-core processors, 2 hardware threads / core
> 
> rs0 fd1026 105 lstopo
> Machine (32GB) + NUMANode L#0 (P#1 32GB)
>  Socket L#0
>    Core L#0
>      PU L#0 (P#0)
>      PU L#1 (P#1)
>    Core L#1
>      PU L#2 (P#2)
>      PU L#3 (P#3)
>    Core L#2
>      PU L#4 (P#4)
>      PU L#5 (P#5)
>    Core L#3
>      PU L#6 (P#6)
>      PU L#7 (P#7)
>  Socket L#1
>    Core L#4
>      PU L#8 (P#8)
>      PU L#9 (P#9)
>    Core L#5
>      PU L#10 (P#10)
>      PU L#11 (P#11)
>    Core L#6
>      PU L#12 (P#12)
>      PU L#13 (P#13)
>    Core L#7
>      PU L#14 (P#14)
>      PU L#15 (P#15)
> 
> tyr fd1026 117 ssh rs0 psrinfo -pv
> The physical processor has 8 virtual processors (0-7)
>  SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
> The physical processor has 8 virtual processors (8-15)
>  SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)
> 
> 
> linpc0, linpc3:
> ---------------
> 
> AMD Athlon64 X2, 1 dual-core processor, no hardware threads
> 
> linpc0 fd1026 102 lstopo
> Machine (4023MB) + Socket L#0
>  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
>  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
> 
> 
> It is strange that openSuSE-Linux-12.1 thinks that two
> dual-core processors are available although the machines
> are only equipped with one processor.
> 
> linpc0 fd1026 104 cat /proc/cpuinfo  | grep -e processor -e "cpu core"
> processor       : 0
> cpu cores       : 2
> processor       : 1
> cpu cores       : 2
> 
> 
> linpc1:
> -------
> 
> Intel Xeon, 2 single core processors, no hardware threads
> 
> linpc1 fd1026 104  lstopo
> Machine (3829MB)
>  Socket L#0 + Core L#0 + PU L#0 (P#0)
>  Socket L#1 + Core L#1 + PU L#1 (P#1)
> 
> tyr fd1026 118 ssh linpc1 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor       : 0
> cpu cores       : 1
> processor       : 1
> cpu cores       : 1
> 
> 
> linpc2:
> -------
> 
> AMD Opteron 280, 2 dual-core processors, no hardware threads
> 
> linpc2 fd1026 103 lstopo
> Machine (8190MB)
>  NUMANode L#0 (P#0 4094MB) + Socket L#0
>    L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
>    L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
>  NUMANode L#1 (P#1 4096MB) + Socket L#1
>    L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
>    L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
> 
> It is strange that openSuSE-Linux-12.1 thinks that four
> dual-core processors are available although the machine
> is only equipped with two processors.
> 
> linpc2 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor       : 0
> cpu cores       : 2
> processor       : 1
> cpu cores       : 2
> processor       : 2
> cpu cores       : 2
> processor       : 3
> cpu cores       : 2
> 
> 
> 
> linpc4:
> -------
> 
> AMD Opteron 1218, 1 dual-core processors, no hardware threads
> 
> linpc4 fd1026 100 lstopo
> Machine (4024MB) + Socket L#0
>  L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
>  L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
> 
> It is strange that openSuSE-Linux-12.1 thinks that two
> dual-core processors are available although the machine
> is only equipped with one processor.
> 
> tyr fd1026 230 ssh linpc4 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor       : 0
> cpu cores       : 2
> processor       : 1
> cpu cores       : 2
> 
> 
> 
> sunpc0, sunpc3:
> ---------------
> 
> AMD Athlon64 X2, 1 dual-core processor, no hardware threads
> 
> sunpc0 fd1026 104 lstopo
> Machine (4094MB) + NUMANode L#0 (P#0 4094MB) + Socket L#0
>  Core L#0 + PU L#0 (P#0)
>  Core L#1 + PU L#1 (P#1)
> 
> tyr fd1026 111 ssh sunpc0 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
>  x86 (chipid 0x0 AuthenticAMD family 15 model 43 step 1 clock 2000 MHz)
>        AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
> 
> 
> sunpc1:
> -------
> 
> AMD Opteron 280, 2 dual-core processors, no hardware threads
> 
> sunpc1 fd1026 104 lstopo
> Machine (8191MB)
>  NUMANode L#0 (P#1 4095MB) + Socket L#0
>    Core L#0 + PU L#0 (P#0)
>    Core L#1 + PU L#1 (P#1)
>  NUMANode L#1 (P#2 4096MB) + Socket L#1
>    Core L#2 + PU L#2 (P#2)
>    Core L#3 + PU L#3 (P#3)
> 
> tyr fd1026 112 ssh sunpc1 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
>  x86 (chipid 0x0 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
>        Dual Core AMD Opteron(tm) Processor 280
> The physical processor has 2 virtual processors (2 3)
>  x86 (chipid 0x1 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
>        Dual Core AMD Opteron(tm) Processor 280
> 
> 
> sunpc2:
> -------
> 
> Intel Xeon, 2 single core processors, no hardware threads
> 
> sunpc2 fd1026 104 lstopo
> Machine (3904MB) + NUMANode L#0 (P#0 3904MB)
>  Socket L#0 + Core L#0 + PU L#0 (P#0)
>  Socket L#1 + Core L#1 + PU L#1 (P#1)
> 
> tyr fd1026 114 ssh sunpc2 psrinfo -pv
> The physical processor has 1 virtual processor (0)
>  x86 (chipid 0x0 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
>        Intel(r) Xeon(tm) CPU 2.80GHz
> The physical processor has 1 virtual processor (1)
>  x86 (chipid 0x3 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
>        Intel(r) Xeon(tm) CPU 2.80GHz
> 
> 
> sunpc4:
> -------
> 
> AMD Opteron 1218, 1 dual-core processor, no hardware threads
> 
> sunpc4 fd1026 104 lstopo
> Machine (4096MB) + NUMANode L#0 (P#0 4096MB) + Socket L#0
>  Core L#0 + PU L#0 (P#0)
>  Core L#1 + PU L#1 (P#1)
> 
> tyr fd1026 115 ssh sunpc4 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
>  x86 (chipid 0x0 AuthenticAMD family 15 model 67 step 2 clock 2613 MHz)
>        Dual-Core AMD Opteron(tm) Processor 1218
> 
> 
> 
> 
> Among others I got the following error messages (I can provide
> the complete file if you are interested in it).
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bycore init_finalize
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
> child 
> [[30908,1],2] to cpus 0004
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
> child 
> [[30908,1],0] to cpus 0001
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
> child 
> [[30908,1],1] to cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error
> on node tyr.informatik.hs-fulda.de. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bysocket init_finalize
> --------------------------------------------------------------------------
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor.
> 
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M).  Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host.
> 
> You job will now abort.
> --------------------------------------------------------------------------
> [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding 
> child 
> [[30907,1],0] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding 
> child 
> [[30907,1],1] to socket 1 cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error
> on node tyr.informatik.hs-fulda.de. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bycore init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05715] [[30936,0],1] odls:default:fork binding 
> child 
> [[30936,1],0] to cpus 0001
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Resource temporarily unavailable
> Node: rs0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bysocket init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05743] [[30916,0],1] odls:default:fork binding 
> child 
> [[30916,1],0] to socket 0 cpus 0001
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Resource temporarily unavailable
> Node: rs0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bycore init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05771] [[30912,0],1] odls:default:fork binding 
> child 
> [[30912,1],0] to socket 0 cpus 0055
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Resource temporarily unavailable
> Node: rs0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bysocket 
> init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05799] [[30924,0],1] odls:default:fork binding 
> child 
> [[30924,1],0] to socket 0 cpus 0055
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Resource temporarily unavailable
> Node: rs0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bycore 
> init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],0] to 
> cpus 0001
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],1] to 
> cpus 0002
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],2] to 
> cpus 0004
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error
> on node linpc0. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bysocket 
> init_finalize
> --------------------------------------------------------------------------
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor.
> 
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M).  Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host.
> 
> You job will now abort.
> --------------------------------------------------------------------------
> [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],0] to 
> socket 0 cpus 0001
> [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],1] to 
> socket 0 cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error
> on node linpc0. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bycore 
> init_finalize
> --------------------------------------------------------------------------
> Unable to bind to socket 0 on node linpc0.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Fatal
> Node: linpc0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bysocket 
> init_finalize
> --------------------------------------------------------------------------
> Unable to bind to socket 0 on node linpc0.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Fatal
> Node: linpc0
> 
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
> 
> 
> 
> Hopefully this helps to track the error. Thank you very much
> for your help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
>>> I wrapped long lines so that they
>>> are easier to read. Have you had time to look at the
>>> segmentation fault with a rankfile which I reported in my
>>> last email (see below)?
>> 
>> I'm afraid not - been too busy lately. I'd suggest first focusing
>> on getting binding to work.
>> 
>>> 
>>> "tyr" is a two processor single core machine.
>>> 
>>> tyr fd1026 116 mpiexec -report-bindings -np 4 \
>>> -bind-to-socket -bycore rank_size
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],1] to socket 1 cpus 0002
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],2] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],3] to socket 1 cpus 0002
>>> I'm process 0 of 4 ...
>>> 
>>> 
>>> tyr fd1026 121 mpiexec -report-bindings -np 4 \
>>> -bind-to-socket -bysocket rank_size
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],1] to socket 1 cpus 0002
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],2] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],3] to socket 1 cpus 0002
>>> I'm process 0 of 4 ...
>>> 
>>> 
>>> tyr fd1026 117 mpiexec -report-bindings -np 4 \
>>> -bind-to-core -bycore rank_size
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],2] to cpus 0004
>>> ------------------------------------------------------------------
>>> An attempt to set processor affinity has failed - please check to
>>> ensure that your system supports such functionality. If so, then
>>> this is probably something that should be reported to the OMPI
>>> developers.
>>> ------------------------------------------------------------------
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],0] to cpus 0001
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],1] to cpus 0002
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application
>>> as it encountered an error
>>> on node tyr.informatik.hs-fulda.de. More information may be
>>> available above.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>> 
>>> 
>>> 
>>> tyr fd1026 118 mpiexec -report-bindings -np 4 \
>>> -bind-to-core -bysocket rank_size
>>> ------------------------------------------------------------------
>>> An invalid physical processor ID was returned when attempting to
>>> bind
>>> an MPI process to a unique processor.
>>> 
>>> This usually means that you requested binding to more processors
>>> than
>>> 
>>> exist (e.g., trying to bind N MPI processes to M processors,
>>> where N >
>>> M).  Double check that you have enough unique processors for
>>> all the
>>> MPI processes that you are launching on this host.
>>> 
>>> You job will now abort.
>>> ------------------------------------------------------------------
>>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>> fork binding child [[27347,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>> fork binding child [[27347,1],1] to socket 1 cpus 0002
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error
>>> on node tyr.informatik.hs-fulda.de. More information may be
>>> available above.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>> tyr fd1026 119 
>>> 
>>> 
>>> 
>>> "linpc3" and "linpc4" are two processor dual core machines.
>>> 
>>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-core -bycore rank_size
>>> [linpc4:16842] [[40914,0],0] odls:default:
>>> fork binding child [[40914,1],1] to cpus 0001
>>> [linpc4:16842] [[40914,0],0] odls:default:
>>> fork binding child [[40914,1],3] to cpus 0002
>>> [linpc3:31384] [[40914,0],1] odls:default:
>>> fork binding child [[40914,1],0] to cpus 0001
>>> [linpc3:31384] [[40914,0],1] odls:default:
>>> fork binding child [[40914,1],2] to cpus 0002
>>> I'm process 1 of 4 ...
>>> 
>>> 
>>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-core -bysocket rank_size
>>> [linpc4:16846] [[40918,0],0] odls:default:
>>> fork binding child [[40918,1],1] to socket 0 cpus 0001
>>> [linpc4:16846] [[40918,0],0] odls:default:
>>> fork binding child [[40918,1],3] to socket 0 cpus 0002
>>> [linpc3:31435] [[40918,0],1] odls:default:
>>> fork binding child [[40918,1],0] to socket 0 cpus 0001
>>> [linpc3:31435] [[40918,0],1] odls:default:
>>> fork binding child [[40918,1],2] to socket 0 cpus 0002
>>> I'm process 1 of 4 ...
>>> 
>>> 
>>> 
>>> 
>>> linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-socket -bycore rank_size
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc3.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc4.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error:
>>> 
>>> Error name: Fatal
>>> Node: linpc4
>>> 
>>> when attempting to start process rank 1.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>> linpc4 fd1026 105 
>>> 
>>> 
>>> linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-socket -bysocket rank_size
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc4.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc3.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error:
>>> 
>>> Error name: Fatal
>>> Node: linpc4
>>> 
>>> when attempting to start process rank 1.
>>> --------------------------------------------------------------------------
>>> 4 total processes failed to start
>>> 
>>> 
>>> It's interesting that commands that work on Solaris fail on Linux
>>> and vice versa.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>>>> I couldn't really say for certain - I don't see anything obviously
>>>>> wrong with your syntax, and the code appears to be working or else
>>>>> it would fail on the other nodes as well. The fact that it fails
>>>>> solely on that machine seems suspect.
>>>>> 
>>>>> Set aside the rankfile for the moment and try to just bind to cores
>>>>> on that machine, something like:
>>>>> 
>>>>> mpiexec --report-bindings -bind-to-core
>>>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>>>> 
>>>>> If that doesn't work, then the problem isn't with rankfile
>>>> 
>>>> It doesn't work but I found out something else as you can see below.
>>>> I get a segmentation fault for some rankfiles.
>>>> 
>>>> 
>>>> tyr small_prog 110 mpiexec --report-bindings -bind-to-core
>>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>>> --------------------------------------------------------------------------
>>>> An attempt to set processor affinity has failed - please check to
>>>> ensure that your system supports such functionality. If so, then
>>>> this is probably something that should be reported to the OMPI developers.
>>>> --------------------------------------------------------------------------
>>>> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default:
>>>> fork binding child [[30561,1],0] to cpus 0001
>>>> --------------------------------------------------------------------------
>>>> mpiexec was unable to start the specified application as it
>>>> encountered an error:
>>>> 
>>>> Error name: Resource temporarily unavailable
>>>> Node: rs0.informatik.hs-fulda.de
>>>> 
>>>> when attempting to start process rank 0.
>>>> --------------------------------------------------------------------------
>>>> 2 total processes failed to start
>>>> tyr small_prog 111 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Perhaps I have a hint for the error on Solaris Sparc. I use the
>>>> following rankfile to keep everything simple.
>>>> 
>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>>> rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>>> rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>>> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>>> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>>> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>> 
>>>> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size"
>>>> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine.
>>>> 
>>>> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [linpc4:08018] [[49482,0],0] odls:default:fork binding child
>>>> [[49482,1],5] to slot_list 0:0
>>>> [linpc3:22030] [[49482,0],4] odls:default:fork binding child
>>>> [[49482,1],4] to slot_list 0:0
>>>> [linpc0:12887] [[49482,0],2] odls:default:fork binding child
>>>> [[49482,1],1] to slot_list 0:0
>>>> [linpc1:08323] [[49482,0],3] odls:default:fork binding child
>>>> [[49482,1],2] to slot_list 0:0
>>>> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child
>>>> [[49482,1],7] to slot_list 0:0
>>>> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork
>>>> binding child [[49482,1],9] to slot_list 0:0
>>>> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork
>>>> binding child [[49482,1],6] to slot_list 0:0
>>>> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork
>>>> binding child [[49482,1],0] to slot_list 0:0
>>>> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork
>>>> binding child [[49482,1],8] to slot_list 0:0
>>>> ...
>>>> 
>>>> 
>>>> 
>>>> I get a segmentation fault when I run it on my local machine
>>>> (Solaris Sparc).
>>>> 
>>>> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG:
>>>> Data unpack would read past end of buffer in file
>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>> at line 927
>>>> [tyr:21421] *** Process received signal ***
>>>> [tyr:21421] Signal: Segmentation Fault (11)
>>>> [tyr:21421] Signal code: Address not mapped (1)
>>>> [tyr:21421] Failing at address: 5ba
>>>> 
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec
>>>> /lib/libc.so.1:0xcad04
>>>> /lib/libc.so.1:0xbf3b4
>>>> /lib/libc.so.1:0xbf59c
>>>> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)]
>>>> /lib/libc.so.1:free+0x24
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_odls_base_default_construct_child_list+0x1234
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_odls_default.so:0x90b8
>>>> 
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_daemon_cmd_processor+0x328
>>>> 
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> opal_event_base_loop+0x228
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> opal_progress+0xec
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_plm_base_report_launched+0x1c4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_plm_base_launch_apps+0x318
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so:
>>>> orte_plm_rsh_launch+0xac4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8
>>>> [tyr:21421] *** End of error message ***
>>>> Segmentation fault
>>>> tyr small_prog 142 
>>>> 
>>>> 
>>>> The funny thing is that I get a segmentation fault on the Linux
>>>> machine as well if I change my rankfile in the following way.
>>>> 
>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>>> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>>> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>>> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>>> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>> 
>>>> 
>>>> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would
>>>> read past end of buffer in file 
>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>> at line 927
>>>> [linpc4:08402] *** Process received signal ***
>>>> [linpc4:08402] Signal: Segmentation fault (11)
>>>> [linpc4:08402] Signal code: Address not mapped (1)
>>>> [linpc4:08402] Failing at address: 0x5f32fffc
>>>> [linpc4:08402] [ 0] [0xffffe410]
>>>> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_odls_default.so(+0x4023) [0xf73ec023]
>>>> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(+0x42b91) [0xf7667b91]
>>>> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3]
>>>> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(+0x8f366) [0xf76b4366]
>>>> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc]
>>>> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526]
>>>> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_progress+0xba) [0xf769303a]
>>>> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f]
>>>> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27]
>>>> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd]
>>>> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf]
>>>> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273]
>>>> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003]
>>>> [linpc4:08402] *** End of error message ***
>>>> Segmentation fault
>>>> linpc4 small_prog 107 
>>>> 
>>>> 
>>>> Hopefully this information helps to fix the problem.
>>>> 
>>>> 
>>>> Kind regards
>>>> 
>>>> Siegmar
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross 
>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm new to rankfiles so that I played a little bit with different
>>>>>> options. I thought that the following entry would be similar to an
>>>>>> entry in an appfile and that MPI could place the process with rank 0
>>>>>> on any core of any processor.
>>>>>> 
>>>>>> rank 0=tyr.informatik.hs-fulda.de
>>>>>> 
>>>>>> Unfortunately it's not allowed and I got an error. Can somebody add
>>>>>> the missing help to the file?
>>>>>> 
>>>>>> 
>>>>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> Sorry!  You were supposed to get help about:
>>>>>>  no-slot-list
>>>>>> from the file:
>>>>>>  help-rmaps_rank_file.txt
>>>>>> But I couldn't find that topic in the file.  Sorry!
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> As you can see below I could use a rankfile on my old local machine
>>>>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
>>>>>> logged into the machine via ssh and tried the same command once more
>>>>>> as a local user without success. It's more or less the same error as
>>>>>> before when I tried to bind the process to a remote machine.
>>>>>> 
>>>>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
>>>>>> binding child [[19734,1],0] to slot_list 0:0
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> We were unable to successfully process/set the requested processor
>>>>>> affinity settings:
>>>>>> 
>>>>>> Specified slot list: 0:0
>>>>>> Error: Cross-device link
>>>>>> 
>>>>>> This could mean that a non-existent processor was specified, or
>>>>>> that the specification had improper syntax.
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> mpiexec was unable to start the specified application as it encountered 
> an 
>>> error:
>>>>>> 
>>>>>> Error name: No such file or directory
>>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>> 
>>>>>> when attempting to start process rank 0.
>>>>>> 
> --------------------------------------------------------------------------
>>>>>> rs0 small_prog 119 
>>>>>> 
>>>>>> 
>>>>>> The application is available.
>>>>>> 
>>>>>> rs0 small_prog 119 which rank_size
>>>>>> /home/fd1026/SunOS/sparc/bin/rank_size
>>>>>> 
>>>>>> 
>>>>>> Is it a problem in the Open MPI implementation or in my rankfile?
>>>>>> How can I request which sockets and cores per socket are
>>>>>> available so that I can use correct values in my rankfile?
>>>>>> In lam-mpi I had a command "lamnodes" which I could use to get
>>>>>> such information. Thank you very much for any help in advance.
>>>>>> 
>>>>>> 
>>>>>> Kind regards
>>>>>> 
>>>>>> Siegmar
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
>>>>>>> 
>>>>>>> Yes, both machines are Sparc. I tried first in a homogeneous
>>>>>>> environment.
>>>>>>> 
>>>>>>> tyr fd1026 106 psrinfo -v
>>>>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
>>>>>>> on-line since 08/31/2012 15:44:42.
>>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>>      and has a sparcv9 floating point processor.
>>>>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
>>>>>>> on-line since 08/31/2012 15:44:39.
>>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>>      and has a sparcv9 floating point processor.
>>>>>>> tyr fd1026 107 
>>>>>>> 
>>>>>>> My local machine (tyr) is a dual processor machine and the
>>>>>>> other one is equipped with two quad-core processors each
>>>>>>> capable of running two hardware threads.
>>>>>>> 
>>>>>>> 
>>>>>>> Kind regards
>>>>>>> 
>>>>>>> Siegmar
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross 
>>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> the man page for "mpiexec" shows the following:
>>>>>>>>> 
>>>>>>>>>      cat myrankfile
>>>>>>>>>      rank 0=aa slot=1:0-2
>>>>>>>>>      rank 1=bb slot=0:0,1
>>>>>>>>>      rank 2=cc slot=1-2
>>>>>>>>>      mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
>>>>>>>>> 
>>>>>>>>>    Rank 0 runs on node aa, bound to socket 1, cores 0-2.
>>>>>>>>>    Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>>>>>>>>>    Rank 2 runs on node cc, bound to cores 1 and 2.
>>>>>>>>> 
>>>>>>>>> Does it mean that the process with rank 0 should be bound to
>>>>>>>>> core 0, 1, or 2 of socket 1?
>>>>>>>>> 
>>>>>>>>> I tried to use a rankfile and have a problem. My rankfile contains
>>>>>>>>> the following lines.
>>>>>>>>> 
>>>>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
>>>>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Everything is fine if I use the file with just my local machine
>>>>>>>>> (the first two lines).
>>>>>>>>> 
>>>>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
>>>>>>>>> I'm process 0 of 2 available processes running on 
>>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>>> I'm process 1 of 2 available processes running on 
>>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>>> tyr small_prog 116 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I can also change the socket number and the processes will be attached
>>>>>>>>> to the correct cores. Unfortunately it doesn't work if I add one
>>>>>>>>> other machine (third line).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>>> 
>>> --------------------------------------------------------------------------
>>>>>>>>> We were unable to successfully process/set the requested processor
>>>>>>>>> affinity settings:
>>>>>>>>> 
>>>>>>>>> Specified slot list: 0:0
>>>>>>>>> Error: Cross-device link
>>>>>>>>> 
>>>>>>>>> This could mean that a non-existent processor was specified, or
>>>>>>>>> that the specification had improper syntax.
>>>>>>>>> 
>>> --------------------------------------------------------------------------
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
>>>>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
>>>>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>>>>>>> whose contact information is unknown in file
>>>>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
>>>>>>>>> to [[10212,1],0]: tag 20
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
>>>>>>>>> A message is attempting to be sent to a process whose contact
>>>>>>>>> information is unknown in file
>>>>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>>>>>>> at line 2501
>>>>>>>>> 
>>> --------------------------------------------------------------------------
>>>>>>>>> mpiexec was unable to start the specified application as it
>>>>>>>>> encountered an error:
>>>>>>>>> 
>>>>>>>>> Error name: Error 0
>>>>>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>>>>> 
>>>>>>>>> when attempting to start process rank 2.
>>>>>>>>> 
>>> --------------------------------------------------------------------------
>>>>>>>>> tyr small_prog 113 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The other machine has two 8 core processors.
>>>>>>>>> 
>>>>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v
>>>>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
>>>>>>>>> on-line since 07/26/2012 15:03:14.
>>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>>     and has a sparcv9 floating point processor.
>>>>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
>>>>>>>>> ...
>>>>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
>>>>>>>>> on-line since 07/26/2012 15:03:16.
>>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>>     and has a sparcv9 floating point processor.
>>>>>>>>> tyr small_prog 122 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Is it necessary to specify another option on the command line or
>>>>>>>>> is my rankfile faulty? Thank you very much for any suggestions in
>>>>>>>>> advance.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Kind regards
>>>>>>>>> 
>>>>>>>>> Siegmar
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>> 
>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to