We actually include hwloc v1.3.2 in the OMPI v1.6 series. Can you download and try that on your machines?
http://www.open-mpi.org/software/hwloc/v1.3/ In particular try the hwloc-bind executable (outside of OMPI), and see if binding works properly on your machines. I typically run a test script when I'm testing binding: ------ [12:59] svbu-mpi059:~/mpi % lstopo --no-io Machine (64GB) NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#20) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#21) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#22) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#23) NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#24) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#25) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#26) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#27) L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#28) L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#29) L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#30) L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#31) [12:59] svbu-mpi059:~/mpi % hwloc-bind socket:1.core:5 -l ./report-bindings.sh MCW rank (svbu-mpi059): Socket:1.Core:5.PU:13 Socket:1.Core:5.PU:29 [13:00] svbu-mpi059:~/mpi % cat report-bindings.sh #!/bin/sh bitmap=`hwloc-bind --get -p` friendly=`hwloc-calc -p -H socket.core.pu $bitmap` echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly" exit 0 [13:00] svbu-mpi059:~/mpi % ----- Try just running hwloc-bind and binding yourself to some logical location, and run my report-bindings.sh script, and see if the physical indexes that it outputs are correct. On Sep 10, 2012, at 7:34 AM, Siegmar Gross wrote: > Hi, > >>> are the following outputs helpful to find the error with >>> a rankfile on Solaris? >> >> If you can't bind on the new Solaris machine, then the rankfile >> won't do you any good. It looks like we are getting the incorrect >> number of cores on that machine - is it possible that it has >> hardware threads, and doesn't report "cores"? Can you download >> and run a copy of lstopo to check the output? You get that from >> the hwloc folks: >> >> http://www.open-mpi.org/software/hwloc/v1.5/ > > I downloaded and installed the package on our machines. Perhaps it is > easier to detect the error if you have more information. Therefore I > provide the different hardware architecures of all machines on which > a simple program breaks if I try to bind processes to sockets or cores. > > I tried the following five commands with "h" one of "tyr", "rs0", > "linpc0", "linpc1", "linpc2", "linpc4", "sunpc0", "sunpc1", > "sunpc2", or "sunpc4" in a shell script file which I started on > my local machine ("tyr"). "works on" means that the small program > (MPI_Init, printf, MPI_Finalize) didn't break. I didn't check if > the layout of the processes was correct. > > > mpiexec -report-bindings -np 4 -host h init_finalize > > works on: tyr, rs0, linpc0, linpc1, linpc2, linpc4, sunpc0, sunpc1, > sunpc2, sunpc4 > breaks on: - > > > mpiexec -report-bindings -np 4 -host h -bind-to-core -bycore init_finalize > > works on: linpc2, sunpc1 > breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4 > > > mpiexec -report-bindings -np 4 -host h -bind-to-core -bysocket init_finalize > > works on: linpc2, sunpc1 > breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4 > > > mpiexec -report-bindings -np 4 -host h -bind-to-socket -bycore init_finalize > > works on: tyr, linpc1, linpc2, sunpc1, sunpc2 > breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4 > > > mpiexec -report-bindings -np 4 -host h -bind-to-socket -bysocket init_finalize > > works on: tyr, linpc1, linpc2, sunpc1, sunpc2 > breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4 > > > > "lstopo" shows the following hardware configurations for the above > machines. The first line always shows the installed architecture. > "lstopo" does a good job as far as I can see it. > > tyr: > ---- > > UltraSPARC-IIIi, 2 single core processors, no hardware threads > > tyr fd1026 183 lstopo > Machine (4096MB) > NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0) > NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1) > > tyr fd1026 116 psrinfo -pv > The physical processor has 1 virtual processor (0) > UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1600 MHz) > The physical processor has 1 virtual processor (1) > UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1600 MHz) > > > rs0, rs1: > --------- > > SPARC64-VII, 2 quad-core processors, 2 hardware threads / core > > rs0 fd1026 105 lstopo > Machine (32GB) + NUMANode L#0 (P#1 32GB) > Socket L#0 > Core L#0 > PU L#0 (P#0) > PU L#1 (P#1) > Core L#1 > PU L#2 (P#2) > PU L#3 (P#3) > Core L#2 > PU L#4 (P#4) > PU L#5 (P#5) > Core L#3 > PU L#6 (P#6) > PU L#7 (P#7) > Socket L#1 > Core L#4 > PU L#8 (P#8) > PU L#9 (P#9) > Core L#5 > PU L#10 (P#10) > PU L#11 (P#11) > Core L#6 > PU L#12 (P#12) > PU L#13 (P#13) > Core L#7 > PU L#14 (P#14) > PU L#15 (P#15) > > tyr fd1026 117 ssh rs0 psrinfo -pv > The physical processor has 8 virtual processors (0-7) > SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz) > The physical processor has 8 virtual processors (8-15) > SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz) > > > linpc0, linpc3: > --------------- > > AMD Athlon64 X2, 1 dual-core processor, no hardware threads > > linpc0 fd1026 102 lstopo > Machine (4023MB) + Socket L#0 > L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) > L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) > > > It is strange that openSuSE-Linux-12.1 thinks that two > dual-core processors are available although the machines > are only equipped with one processor. > > linpc0 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core" > processor : 0 > cpu cores : 2 > processor : 1 > cpu cores : 2 > > > linpc1: > ------- > > Intel Xeon, 2 single core processors, no hardware threads > > linpc1 fd1026 104 lstopo > Machine (3829MB) > Socket L#0 + Core L#0 + PU L#0 (P#0) > Socket L#1 + Core L#1 + PU L#1 (P#1) > > tyr fd1026 118 ssh linpc1 cat /proc/cpuinfo | grep -e processor -e "cpu core" > processor : 0 > cpu cores : 1 > processor : 1 > cpu cores : 1 > > > linpc2: > ------- > > AMD Opteron 280, 2 dual-core processors, no hardware threads > > linpc2 fd1026 103 lstopo > Machine (8190MB) > NUMANode L#0 (P#0 4094MB) + Socket L#0 > L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) > L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#1 4096MB) + Socket L#1 > L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) > L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) > > It is strange that openSuSE-Linux-12.1 thinks that four > dual-core processors are available although the machine > is only equipped with two processors. > > linpc2 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core" > processor : 0 > cpu cores : 2 > processor : 1 > cpu cores : 2 > processor : 2 > cpu cores : 2 > processor : 3 > cpu cores : 2 > > > > linpc4: > ------- > > AMD Opteron 1218, 1 dual-core processors, no hardware threads > > linpc4 fd1026 100 lstopo > Machine (4024MB) + Socket L#0 > L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) > L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) > > It is strange that openSuSE-Linux-12.1 thinks that two > dual-core processors are available although the machine > is only equipped with one processor. > > tyr fd1026 230 ssh linpc4 cat /proc/cpuinfo | grep -e processor -e "cpu core" > processor : 0 > cpu cores : 2 > processor : 1 > cpu cores : 2 > > > > sunpc0, sunpc3: > --------------- > > AMD Athlon64 X2, 1 dual-core processor, no hardware threads > > sunpc0 fd1026 104 lstopo > Machine (4094MB) + NUMANode L#0 (P#0 4094MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > > tyr fd1026 111 ssh sunpc0 psrinfo -pv > The physical processor has 2 virtual processors (0 1) > x86 (chipid 0x0 AuthenticAMD family 15 model 43 step 1 clock 2000 MHz) > AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ > > > sunpc1: > ------- > > AMD Opteron 280, 2 dual-core processors, no hardware threads > > sunpc1 fd1026 104 lstopo > Machine (8191MB) > NUMANode L#0 (P#1 4095MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > NUMANode L#1 (P#2 4096MB) + Socket L#1 > Core L#2 + PU L#2 (P#2) > Core L#3 + PU L#3 (P#3) > > tyr fd1026 112 ssh sunpc1 psrinfo -pv > The physical processor has 2 virtual processors (0 1) > x86 (chipid 0x0 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz) > Dual Core AMD Opteron(tm) Processor 280 > The physical processor has 2 virtual processors (2 3) > x86 (chipid 0x1 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz) > Dual Core AMD Opteron(tm) Processor 280 > > > sunpc2: > ------- > > Intel Xeon, 2 single core processors, no hardware threads > > sunpc2 fd1026 104 lstopo > Machine (3904MB) + NUMANode L#0 (P#0 3904MB) > Socket L#0 + Core L#0 + PU L#0 (P#0) > Socket L#1 + Core L#1 + PU L#1 (P#1) > > tyr fd1026 114 ssh sunpc2 psrinfo -pv > The physical processor has 1 virtual processor (0) > x86 (chipid 0x0 GenuineIntel family 15 model 2 step 9 clock 2791 MHz) > Intel(r) Xeon(tm) CPU 2.80GHz > The physical processor has 1 virtual processor (1) > x86 (chipid 0x3 GenuineIntel family 15 model 2 step 9 clock 2791 MHz) > Intel(r) Xeon(tm) CPU 2.80GHz > > > sunpc4: > ------- > > AMD Opteron 1218, 1 dual-core processor, no hardware threads > > sunpc4 fd1026 104 lstopo > Machine (4096MB) + NUMANode L#0 (P#0 4096MB) + Socket L#0 > Core L#0 + PU L#0 (P#0) > Core L#1 + PU L#1 (P#1) > > tyr fd1026 115 ssh sunpc4 psrinfo -pv > The physical processor has 2 virtual processors (0 1) > x86 (chipid 0x0 AuthenticAMD family 15 model 67 step 2 clock 2613 MHz) > Dual-Core AMD Opteron(tm) Processor 1218 > > > > > Among others I got the following error messages (I can provide > the complete file if you are interested in it). > > ################## > ################## > mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bycore init_finalize > [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding > child > [[30908,1],2] to cpus 0004 > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding > child > [[30908,1],0] to cpus 0001 > [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding > child > [[30908,1],1] to cpus 0002 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error > on node tyr.informatik.hs-fulda.de. More information may be available above. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bysocket init_finalize > -------------------------------------------------------------------------- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor. > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M). Double check that you have enough unique processors for all the > MPI processes that you are launching on this host. > > You job will now abort. > -------------------------------------------------------------------------- > [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding > child > [[30907,1],0] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding > child > [[30907,1],1] to socket 1 cpus 0002 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error > on node tyr.informatik.hs-fulda.de. More information may be available above. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bycore init_finalize > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [rs0.informatik.hs-fulda.de:05715] [[30936,0],1] odls:default:fork binding > child > [[30936,1],0] to cpus 0001 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Resource temporarily unavailable > Node: rs0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bysocket init_finalize > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [rs0.informatik.hs-fulda.de:05743] [[30916,0],1] odls:default:fork binding > child > [[30916,1],0] to socket 0 cpus 0001 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Resource temporarily unavailable > Node: rs0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bycore init_finalize > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [rs0.informatik.hs-fulda.de:05771] [[30912,0],1] odls:default:fork binding > child > [[30912,1],0] to socket 0 cpus 0055 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Resource temporarily unavailable > Node: rs0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bysocket > init_finalize > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [rs0.informatik.hs-fulda.de:05799] [[30924,0],1] odls:default:fork binding > child > [[30924,1],0] to socket 0 cpus 0055 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Resource temporarily unavailable > Node: rs0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bycore > init_finalize > -------------------------------------------------------------------------- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI developers. > -------------------------------------------------------------------------- > [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],0] to > cpus 0001 > [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],1] to > cpus 0002 > [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],2] to > cpus 0004 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error > on node linpc0. More information may be available above. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bysocket > init_finalize > -------------------------------------------------------------------------- > An invalid physical processor ID was returned when attempting to bind > an MPI process to a unique processor. > > This usually means that you requested binding to more processors than > exist (e.g., trying to bind N MPI processes to M processors, where N > > M). Double check that you have enough unique processors for all the > MPI processes that you are launching on this host. > > You job will now abort. > -------------------------------------------------------------------------- > [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],0] to > socket 0 cpus 0001 > [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],1] to > socket 0 cpus 0002 > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error > on node linpc0. More information may be available above. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bycore > init_finalize > -------------------------------------------------------------------------- > Unable to bind to socket 0 on node linpc0. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Fatal > Node: linpc0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > ################## > ################## > mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bysocket > init_finalize > -------------------------------------------------------------------------- > Unable to bind to socket 0 on node linpc0. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec was unable to start the specified application as it encountered an > error: > > Error name: Fatal > Node: linpc0 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > 4 total processes failed to start > > > > Hopefully this helps to track the error. Thank you very much > for your help in advance. > > > Kind regards > > Siegmar > > > >>> I wrapped long lines so that they >>> are easier to read. Have you had time to look at the >>> segmentation fault with a rankfile which I reported in my >>> last email (see below)? >> >> I'm afraid not - been too busy lately. I'd suggest first focusing >> on getting binding to work. >> >>> >>> "tyr" is a two processor single core machine. >>> >>> tyr fd1026 116 mpiexec -report-bindings -np 4 \ >>> -bind-to-socket -bycore rank_size >>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >>> fork binding child [[27298,1],0] to socket 0 cpus 0001 >>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >>> fork binding child [[27298,1],1] to socket 1 cpus 0002 >>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >>> fork binding child [[27298,1],2] to socket 0 cpus 0001 >>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: >>> fork binding child [[27298,1],3] to socket 1 cpus 0002 >>> I'm process 0 of 4 ... >>> >>> >>> tyr fd1026 121 mpiexec -report-bindings -np 4 \ >>> -bind-to-socket -bysocket rank_size >>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >>> fork binding child [[27380,1],0] to socket 0 cpus 0001 >>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >>> fork binding child [[27380,1],1] to socket 1 cpus 0002 >>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >>> fork binding child [[27380,1],2] to socket 0 cpus 0001 >>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: >>> fork binding child [[27380,1],3] to socket 1 cpus 0002 >>> I'm process 0 of 4 ... >>> >>> >>> tyr fd1026 117 mpiexec -report-bindings -np 4 \ >>> -bind-to-core -bycore rank_size >>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >>> fork binding child [[27307,1],2] to cpus 0004 >>> ------------------------------------------------------------------ >>> An attempt to set processor affinity has failed - please check to >>> ensure that your system supports such functionality. If so, then >>> this is probably something that should be reported to the OMPI >>> developers. >>> ------------------------------------------------------------------ >>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >>> fork binding child [[27307,1],0] to cpus 0001 >>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: >>> fork binding child [[27307,1],1] to cpus 0002 >>> ------------------------------------------------------------------ >>> mpiexec was unable to start the specified application >>> as it encountered an error >>> on node tyr.informatik.hs-fulda.de. More information may be >>> available above. >>> ------------------------------------------------------------------ >>> 4 total processes failed to start >>> >>> >>> >>> tyr fd1026 118 mpiexec -report-bindings -np 4 \ >>> -bind-to-core -bysocket rank_size >>> ------------------------------------------------------------------ >>> An invalid physical processor ID was returned when attempting to >>> bind >>> an MPI process to a unique processor. >>> >>> This usually means that you requested binding to more processors >>> than >>> >>> exist (e.g., trying to bind N MPI processes to M processors, >>> where N > >>> M). Double check that you have enough unique processors for >>> all the >>> MPI processes that you are launching on this host. >>> >>> You job will now abort. >>> ------------------------------------------------------------------ >>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: >>> fork binding child [[27347,1],0] to socket 0 cpus 0001 >>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: >>> fork binding child [[27347,1],1] to socket 1 cpus 0002 >>> ------------------------------------------------------------------ >>> mpiexec was unable to start the specified application as it >>> encountered an error >>> on node tyr.informatik.hs-fulda.de. More information may be >>> available above. >>> ------------------------------------------------------------------ >>> 4 total processes failed to start >>> tyr fd1026 119 >>> >>> >>> >>> "linpc3" and "linpc4" are two processor dual core machines. >>> >>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ >>> -np 4 -bind-to-core -bycore rank_size >>> [linpc4:16842] [[40914,0],0] odls:default: >>> fork binding child [[40914,1],1] to cpus 0001 >>> [linpc4:16842] [[40914,0],0] odls:default: >>> fork binding child [[40914,1],3] to cpus 0002 >>> [linpc3:31384] [[40914,0],1] odls:default: >>> fork binding child [[40914,1],0] to cpus 0001 >>> [linpc3:31384] [[40914,0],1] odls:default: >>> fork binding child [[40914,1],2] to cpus 0002 >>> I'm process 1 of 4 ... >>> >>> >>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ >>> -np 4 -bind-to-core -bysocket rank_size >>> [linpc4:16846] [[40918,0],0] odls:default: >>> fork binding child [[40918,1],1] to socket 0 cpus 0001 >>> [linpc4:16846] [[40918,0],0] odls:default: >>> fork binding child [[40918,1],3] to socket 0 cpus 0002 >>> [linpc3:31435] [[40918,0],1] odls:default: >>> fork binding child [[40918,1],0] to socket 0 cpus 0001 >>> [linpc3:31435] [[40918,0],1] odls:default: >>> fork binding child [[40918,1],2] to socket 0 cpus 0002 >>> I'm process 1 of 4 ... >>> >>> >>> >>> >>> linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \ >>> -np 4 -bind-to-socket -bycore rank_size >>> ------------------------------------------------------------------ >>> Unable to bind to socket 0 on node linpc3. >>> ------------------------------------------------------------------ >>> ------------------------------------------------------------------ >>> Unable to bind to socket 0 on node linpc4. >>> ------------------------------------------------------------------ >>> ------------------------------------------------------------------ >>> mpiexec was unable to start the specified application as it >>> encountered an error: >>> >>> Error name: Fatal >>> Node: linpc4 >>> >>> when attempting to start process rank 1. >>> ------------------------------------------------------------------ >>> 4 total processes failed to start >>> linpc4 fd1026 105 >>> >>> >>> linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \ >>> -np 4 -bind-to-socket -bysocket rank_size >>> ------------------------------------------------------------------ >>> Unable to bind to socket 0 on node linpc4. >>> ------------------------------------------------------------------ >>> ------------------------------------------------------------------ >>> Unable to bind to socket 0 on node linpc3. >>> ------------------------------------------------------------------ >>> ------------------------------------------------------------------ >>> mpiexec was unable to start the specified application as it >>> encountered an error: >>> >>> Error name: Fatal >>> Node: linpc4 >>> >>> when attempting to start process rank 1. >>> -------------------------------------------------------------------------- >>> 4 total processes failed to start >>> >>> >>> It's interesting that commands that work on Solaris fail on Linux >>> and vice versa. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>>>> I couldn't really say for certain - I don't see anything obviously >>>>> wrong with your syntax, and the code appears to be working or else >>>>> it would fail on the other nodes as well. The fact that it fails >>>>> solely on that machine seems suspect. >>>>> >>>>> Set aside the rankfile for the moment and try to just bind to cores >>>>> on that machine, something like: >>>>> >>>>> mpiexec --report-bindings -bind-to-core >>>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size >>>>> >>>>> If that doesn't work, then the problem isn't with rankfile >>>> >>>> It doesn't work but I found out something else as you can see below. >>>> I get a segmentation fault for some rankfiles. >>>> >>>> >>>> tyr small_prog 110 mpiexec --report-bindings -bind-to-core >>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size >>>> -------------------------------------------------------------------------- >>>> An attempt to set processor affinity has failed - please check to >>>> ensure that your system supports such functionality. If so, then >>>> this is probably something that should be reported to the OMPI developers. >>>> -------------------------------------------------------------------------- >>>> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default: >>>> fork binding child [[30561,1],0] to cpus 0001 >>>> -------------------------------------------------------------------------- >>>> mpiexec was unable to start the specified application as it >>>> encountered an error: >>>> >>>> Error name: Resource temporarily unavailable >>>> Node: rs0.informatik.hs-fulda.de >>>> >>>> when attempting to start process rank 0. >>>> -------------------------------------------------------------------------- >>>> 2 total processes failed to start >>>> tyr small_prog 111 >>>> >>>> >>>> >>>> >>>> Perhaps I have a hint for the error on Solaris Sparc. I use the >>>> following rankfile to keep everything simple. >>>> >>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0 >>>> rank 2=linpc1.informatik.hs-fulda.de slot=0:0 >>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0 >>>> rank 4=linpc3.informatik.hs-fulda.de slot=0:0 >>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0 >>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0 >>>> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0 >>>> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0 >>>> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0 >>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0 >>>> >>>> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size" >>>> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine. >>>> >>>> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size >>>> [linpc4:08018] [[49482,0],0] odls:default:fork binding child >>>> [[49482,1],5] to slot_list 0:0 >>>> [linpc3:22030] [[49482,0],4] odls:default:fork binding child >>>> [[49482,1],4] to slot_list 0:0 >>>> [linpc0:12887] [[49482,0],2] odls:default:fork binding child >>>> [[49482,1],1] to slot_list 0:0 >>>> [linpc1:08323] [[49482,0],3] odls:default:fork binding child >>>> [[49482,1],2] to slot_list 0:0 >>>> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child >>>> [[49482,1],7] to slot_list 0:0 >>>> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork >>>> binding child [[49482,1],9] to slot_list 0:0 >>>> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork >>>> binding child [[49482,1],6] to slot_list 0:0 >>>> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork >>>> binding child [[49482,1],0] to slot_list 0:0 >>>> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork >>>> binding child [[49482,1],8] to slot_list 0:0 >>>> ... >>>> >>>> >>>> >>>> I get a segmentation fault when I run it on my local machine >>>> (Solaris Sparc). >>>> >>>> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size >>>> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG: >>>> Data unpack would read past end of buffer in file >>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>>> at line 927 >>>> [tyr:21421] *** Process received signal *** >>>> [tyr:21421] Signal: Segmentation Fault (11) >>>> [tyr:21421] Signal code: Address not mapped (1) >>>> [tyr:21421] Failing at address: 5ba >>>> > /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec >>>> /lib/libc.so.1:0xcad04 >>>> /lib/libc.so.1:0xbf3b4 >>>> /lib/libc.so.1:0xbf59c >>>> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)] >>>> /lib/libc.so.1:free+0x24 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> orte_odls_base_default_construct_child_list+0x1234 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/ >>>> mca_odls_default.so:0x90b8 >>>> > /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> orte_daemon_cmd_processor+0x328 >>>> > /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> opal_event_base_loop+0x228 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> opal_progress+0xec >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> orte_plm_base_report_launched+0x1c4 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0: >>>> orte_plm_base_launch_apps+0x318 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so: >>>> orte_plm_rsh_launch+0xac4 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24 >>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8 >>>> [tyr:21421] *** End of error message *** >>>> Segmentation fault >>>> tyr small_prog 142 >>>> >>>> >>>> The funny thing is that I get a segmentation fault on the Linux >>>> machine as well if I change my rankfile in the following way. >>>> >>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0 >>>> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0 >>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0 >>>> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0 >>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0 >>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0 >>>> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0 >>>> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0 >>>> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0 >>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0 >>>> >>>> >>>> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size >>>> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would >>>> read past end of buffer in file >>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>>> at line 927 >>>> [linpc4:08402] *** Process received signal *** >>>> [linpc4:08402] Signal: Segmentation fault (11) >>>> [linpc4:08402] Signal code: Address not mapped (1) >>>> [linpc4:08402] Failing at address: 0x5f32fffc >>>> [linpc4:08402] [ 0] [0xffffe410] >>>> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/ >>>> mca_odls_default.so(+0x4023) [0xf73ec023] >>>> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(+0x42b91) [0xf7667b91] >>>> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3] >>>> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(+0x8f366) [0xf76b4366] >>>> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc] >>>> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526] >>>> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(opal_progress+0xba) [0xf769303a] >>>> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f] >>>> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/ >>>> libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27] >>>> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/ >>>> mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd] >>>> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf] >>>> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273] >>>> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003] >>>> [linpc4:08402] *** End of error message *** >>>> Segmentation fault >>>> linpc4 small_prog 107 >>>> >>>> >>>> Hopefully this information helps to fix the problem. >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> >>>> >>>> >>>> >>>>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross >>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm new to rankfiles so that I played a little bit with different >>>>>> options. I thought that the following entry would be similar to an >>>>>> entry in an appfile and that MPI could place the process with rank 0 >>>>>> on any core of any processor. >>>>>> >>>>>> rank 0=tyr.informatik.hs-fulda.de >>>>>> >>>>>> Unfortunately it's not allowed and I got an error. Can somebody add >>>>>> the missing help to the file? >>>>>> >>>>>> >>>>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size >>>>>> > -------------------------------------------------------------------------- >>>>>> Sorry! You were supposed to get help about: >>>>>> no-slot-list >>>>>> from the file: >>>>>> help-rmaps_rank_file.txt >>>>>> But I couldn't find that topic in the file. Sorry! >>>>>> > -------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> As you can see below I could use a rankfile on my old local machine >>>>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I >>>>>> logged into the machine via ssh and tried the same command once more >>>>>> as a local user without success. It's more or less the same error as >>>>>> before when I tried to bind the process to a remote machine. >>>>>> >>>>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size >>>>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork >>>>>> binding child [[19734,1],0] to slot_list 0:0 >>>>>> > -------------------------------------------------------------------------- >>>>>> We were unable to successfully process/set the requested processor >>>>>> affinity settings: >>>>>> >>>>>> Specified slot list: 0:0 >>>>>> Error: Cross-device link >>>>>> >>>>>> This could mean that a non-existent processor was specified, or >>>>>> that the specification had improper syntax. >>>>>> > -------------------------------------------------------------------------- >>>>>> > -------------------------------------------------------------------------- >>>>>> mpiexec was unable to start the specified application as it encountered > an >>> error: >>>>>> >>>>>> Error name: No such file or directory >>>>>> Node: rs0.informatik.hs-fulda.de >>>>>> >>>>>> when attempting to start process rank 0. >>>>>> > -------------------------------------------------------------------------- >>>>>> rs0 small_prog 119 >>>>>> >>>>>> >>>>>> The application is available. >>>>>> >>>>>> rs0 small_prog 119 which rank_size >>>>>> /home/fd1026/SunOS/sparc/bin/rank_size >>>>>> >>>>>> >>>>>> Is it a problem in the Open MPI implementation or in my rankfile? >>>>>> How can I request which sockets and cores per socket are >>>>>> available so that I can use correct values in my rankfile? >>>>>> In lam-mpi I had a command "lamnodes" which I could use to get >>>>>> such information. Thank you very much for any help in advance. >>>>>> >>>>>> >>>>>> Kind regards >>>>>> >>>>>> Siegmar >>>>>> >>>>>> >>>>>> >>>>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)? >>>>>>> >>>>>>> Yes, both machines are Sparc. I tried first in a homogeneous >>>>>>> environment. >>>>>>> >>>>>>> tyr fd1026 106 psrinfo -v >>>>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14 >>>>>>> on-line since 08/31/2012 15:44:42. >>>>>>> The sparcv9 processor operates at 1600 MHz, >>>>>>> and has a sparcv9 floating point processor. >>>>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14 >>>>>>> on-line since 08/31/2012 15:44:39. >>>>>>> The sparcv9 processor operates at 1600 MHz, >>>>>>> and has a sparcv9 floating point processor. >>>>>>> tyr fd1026 107 >>>>>>> >>>>>>> My local machine (tyr) is a dual processor machine and the >>>>>>> other one is equipped with two quad-core processors each >>>>>>> capable of running two hardware threads. >>>>>>> >>>>>>> >>>>>>> Kind regards >>>>>>> >>>>>>> Siegmar >>>>>>> >>>>>>> >>>>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross >>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> the man page for "mpiexec" shows the following: >>>>>>>>> >>>>>>>>> cat myrankfile >>>>>>>>> rank 0=aa slot=1:0-2 >>>>>>>>> rank 1=bb slot=0:0,1 >>>>>>>>> rank 2=cc slot=1-2 >>>>>>>>> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that >>>>>>>>> >>>>>>>>> Rank 0 runs on node aa, bound to socket 1, cores 0-2. >>>>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. >>>>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2. >>>>>>>>> >>>>>>>>> Does it mean that the process with rank 0 should be bound to >>>>>>>>> core 0, 1, or 2 of socket 1? >>>>>>>>> >>>>>>>>> I tried to use a rankfile and have a problem. My rankfile contains >>>>>>>>> the following lines. >>>>>>>>> >>>>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0 >>>>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0 >>>>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Everything is fine if I use the file with just my local machine >>>>>>>>> (the first two lines). >>>>>>>>> >>>>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size >>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] >>>>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0 >>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0] >>>>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0 >>>>>>>>> I'm process 0 of 2 available processes running on >>>>>>> tyr.informatik.hs-fulda.de. >>>>>>>>> MPI standard 2.1 is supported. >>>>>>>>> I'm process 1 of 2 available processes running on >>>>>>> tyr.informatik.hs-fulda.de. >>>>>>>>> MPI standard 2.1 is supported. >>>>>>>>> tyr small_prog 116 >>>>>>>>> >>>>>>>>> >>>>>>>>> I can also change the socket number and the processes will be attached >>>>>>>>> to the correct cores. Unfortunately it doesn't work if I add one >>>>>>>>> other machine (third line). >>>>>>>>> >>>>>>>>> >>>>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size >>>>>>>>> >>> -------------------------------------------------------------------------- >>>>>>>>> We were unable to successfully process/set the requested processor >>>>>>>>> affinity settings: >>>>>>>>> >>>>>>>>> Specified slot list: 0:0 >>>>>>>>> Error: Cross-device link >>>>>>>>> >>>>>>>>> This could mean that a non-existent processor was specified, or >>>>>>>>> that the specification had improper syntax. >>>>>>>>> >>> -------------------------------------------------------------------------- >>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0 >>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0 >>>>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1] >>>>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0 >>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] >>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process >>>>>>>>> whose contact information is unknown in file >>>>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145 >>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send >>>>>>>>> to [[10212,1],0]: tag 20 >>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG: >>>>>>>>> A message is attempting to be sent to a process whose contact >>>>>>>>> information is unknown in file >>>>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c >>>>>>>>> at line 2501 >>>>>>>>> >>> -------------------------------------------------------------------------- >>>>>>>>> mpiexec was unable to start the specified application as it >>>>>>>>> encountered an error: >>>>>>>>> >>>>>>>>> Error name: Error 0 >>>>>>>>> Node: rs0.informatik.hs-fulda.de >>>>>>>>> >>>>>>>>> when attempting to start process rank 2. >>>>>>>>> >>> -------------------------------------------------------------------------- >>>>>>>>> tyr small_prog 113 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The other machine has two 8 core processors. >>>>>>>>> >>>>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v >>>>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15 >>>>>>>>> on-line since 07/26/2012 15:03:14. >>>>>>>>> The sparcv9 processor operates at 2400 MHz, >>>>>>>>> and has a sparcv9 floating point processor. >>>>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15 >>>>>>>>> ... >>>>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15 >>>>>>>>> on-line since 07/26/2012 15:03:16. >>>>>>>>> The sparcv9 processor operates at 2400 MHz, >>>>>>>>> and has a sparcv9 floating point processor. >>>>>>>>> tyr small_prog 122 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Is it necessary to specify another option on the command line or >>>>>>>>> is my rankfile faulty? Thank you very much for any suggestions in >>>>>>>>> advance. >>>>>>>>> >>>>>>>>> >>>>>>>>> Kind regards >>>>>>>>> >>>>>>>>> Siegmar >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>> >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/