Hi,

> > are the following outputs helpful to find the error with
> > a rankfile on Solaris?
> 
> If you can't bind on the new Solaris machine, then the rankfile
> won't do you any good. It looks like we are getting the incorrect
> number of cores on that machine - is it possible that it has
> hardware threads, and doesn't report "cores"? Can you download
> and run a copy of lstopo to check the output? You get that from
> the hwloc folks:
> 
> http://www.open-mpi.org/software/hwloc/v1.5/

I downloaded and installed the package on our machines. Perhaps it is
easier to detect the error if you have more information. Therefore I
provide the different hardware architecures of all machines on which
a simple program breaks if I try to bind processes to sockets or cores.

I tried the following five commands with "h" one of "tyr", "rs0",
"linpc0", "linpc1", "linpc2", "linpc4", "sunpc0", "sunpc1",
"sunpc2", or "sunpc4" in a shell script file which I started on
my local machine ("tyr"). "works on" means that the small program
(MPI_Init, printf, MPI_Finalize) didn't break. I didn't check if
the layout of the processes was correct.


mpiexec -report-bindings -np 4 -host h init_finalize

works on:  tyr, rs0, linpc0, linpc1, linpc2, linpc4, sunpc0, sunpc1,
           sunpc2, sunpc4
breaks on: -


mpiexec -report-bindings -np 4 -host h -bind-to-core -bycore init_finalize

works on:  linpc2, sunpc1
breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4


mpiexec -report-bindings -np 4 -host h -bind-to-core -bysocket init_finalize

works on:  linpc2, sunpc1
breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4


mpiexec -report-bindings -np 4 -host h -bind-to-socket -bycore init_finalize

works on:  tyr, linpc1, linpc2, sunpc1, sunpc2
breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4


mpiexec -report-bindings -np 4 -host h -bind-to-socket -bysocket init_finalize

works on:  tyr, linpc1, linpc2, sunpc1, sunpc2
breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4



"lstopo" shows the following hardware configurations for the above
machines. The first line always shows the installed architecture.
"lstopo" does a good job as far as I can see it.

tyr:
----

UltraSPARC-IIIi, 2 single core processors, no hardware threads

tyr fd1026 183 lstopo
Machine (4096MB)
  NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0)
  NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1)

tyr fd1026 116 psrinfo -pv
The physical processor has 1 virtual processor (0)
  UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1600 MHz)
The physical processor has 1 virtual processor (1)
  UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1600 MHz)


rs0, rs1:
---------

SPARC64-VII, 2 quad-core processors, 2 hardware threads / core

rs0 fd1026 105 lstopo
Machine (32GB) + NUMANode L#0 (P#1 32GB)
  Socket L#0
    Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)
    Core L#2
      PU L#4 (P#4)
      PU L#5 (P#5)
    Core L#3
      PU L#6 (P#6)
      PU L#7 (P#7)
  Socket L#1
    Core L#4
      PU L#8 (P#8)
      PU L#9 (P#9)
    Core L#5
      PU L#10 (P#10)
      PU L#11 (P#11)
    Core L#6
      PU L#12 (P#12)
      PU L#13 (P#13)
    Core L#7
      PU L#14 (P#14)
      PU L#15 (P#15)

tyr fd1026 117 ssh rs0 psrinfo -pv
The physical processor has 8 virtual processors (0-7)
  SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
The physical processor has 8 virtual processors (8-15)
  SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)


linpc0, linpc3:
---------------

AMD Athlon64 X2, 1 dual-core processor, no hardware threads

linpc0 fd1026 102 lstopo
Machine (4023MB) + Socket L#0
  L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)


It is strange that openSuSE-Linux-12.1 thinks that two
dual-core processors are available although the machines
are only equipped with one processor.

linpc0 fd1026 104 cat /proc/cpuinfo  | grep -e processor -e "cpu core"
processor       : 0
cpu cores       : 2
processor       : 1
cpu cores       : 2


linpc1:
-------

Intel Xeon, 2 single core processors, no hardware threads

linpc1 fd1026 104  lstopo
Machine (3829MB)
  Socket L#0 + Core L#0 + PU L#0 (P#0)
  Socket L#1 + Core L#1 + PU L#1 (P#1)

tyr fd1026 118 ssh linpc1 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor       : 0
cpu cores       : 1
processor       : 1
cpu cores       : 1


linpc2:
-------

AMD Opteron 280, 2 dual-core processors, no hardware threads

linpc2 fd1026 103 lstopo
Machine (8190MB)
  NUMANode L#0 (P#0 4094MB) + Socket L#0
    L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#1 4096MB) + Socket L#1
    L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)

It is strange that openSuSE-Linux-12.1 thinks that four
dual-core processors are available although the machine
is only equipped with two processors.

linpc2 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor       : 0
cpu cores       : 2
processor       : 1
cpu cores       : 2
processor       : 2
cpu cores       : 2
processor       : 3
cpu cores       : 2



linpc4:
-------

AMD Opteron 1218, 1 dual-core processors, no hardware threads

linpc4 fd1026 100 lstopo
Machine (4024MB) + Socket L#0
  L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
  L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)

It is strange that openSuSE-Linux-12.1 thinks that two
dual-core processors are available although the machine
is only equipped with one processor.

tyr fd1026 230 ssh linpc4 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor       : 0
cpu cores       : 2
processor       : 1
cpu cores       : 2



sunpc0, sunpc3:
---------------

AMD Athlon64 X2, 1 dual-core processor, no hardware threads

sunpc0 fd1026 104 lstopo
Machine (4094MB) + NUMANode L#0 (P#0 4094MB) + Socket L#0
  Core L#0 + PU L#0 (P#0)
  Core L#1 + PU L#1 (P#1)

tyr fd1026 111 ssh sunpc0 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
  x86 (chipid 0x0 AuthenticAMD family 15 model 43 step 1 clock 2000 MHz)
        AMD Athlon(tm) 64 X2 Dual Core Processor 3800+


sunpc1:
-------

AMD Opteron 280, 2 dual-core processors, no hardware threads

sunpc1 fd1026 104 lstopo
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)

tyr fd1026 112 ssh sunpc1 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
  x86 (chipid 0x0 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
        Dual Core AMD Opteron(tm) Processor 280
The physical processor has 2 virtual processors (2 3)
  x86 (chipid 0x1 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
        Dual Core AMD Opteron(tm) Processor 280


sunpc2:
-------

Intel Xeon, 2 single core processors, no hardware threads

sunpc2 fd1026 104 lstopo
Machine (3904MB) + NUMANode L#0 (P#0 3904MB)
  Socket L#0 + Core L#0 + PU L#0 (P#0)
  Socket L#1 + Core L#1 + PU L#1 (P#1)

tyr fd1026 114 ssh sunpc2 psrinfo -pv
The physical processor has 1 virtual processor (0)
  x86 (chipid 0x0 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
        Intel(r) Xeon(tm) CPU 2.80GHz
The physical processor has 1 virtual processor (1)
  x86 (chipid 0x3 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
        Intel(r) Xeon(tm) CPU 2.80GHz


sunpc4:
-------

AMD Opteron 1218, 1 dual-core processor, no hardware threads

sunpc4 fd1026 104 lstopo
Machine (4096MB) + NUMANode L#0 (P#0 4096MB) + Socket L#0
  Core L#0 + PU L#0 (P#0)
  Core L#1 + PU L#1 (P#1)

tyr fd1026 115 ssh sunpc4 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
  x86 (chipid 0x0 AuthenticAMD family 15 model 67 step 2 clock 2613 MHz)
        Dual-Core AMD Opteron(tm) Processor 1218




Among others I got the following error messages (I can provide
the complete file if you are interested in it).

##################
##################
mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bycore init_finalize
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
child 
[[30908,1],2] to cpus 0004
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
child 
[[30908,1],0] to cpus 0001
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding 
child 
[[30908,1],1] to cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node tyr.informatik.hs-fulda.de. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bysocket init_finalize
--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M).  Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.

You job will now abort.
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding 
child 
[[30907,1],0] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding 
child 
[[30907,1],1] to socket 1 cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node tyr.informatik.hs-fulda.de. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05715] [[30936,0],1] odls:default:fork binding 
child 
[[30936,1],0] to cpus 0001
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Resource temporarily unavailable
Node: rs0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bysocket init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05743] [[30916,0],1] odls:default:fork binding 
child 
[[30916,1],0] to socket 0 cpus 0001
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Resource temporarily unavailable
Node: rs0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05771] [[30912,0],1] odls:default:fork binding 
child 
[[30912,1],0] to socket 0 cpus 0055
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Resource temporarily unavailable
Node: rs0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bysocket init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05799] [[30924,0],1] odls:default:fork binding 
child 
[[30924,1],0] to socket 0 cpus 0055
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Resource temporarily unavailable
Node: rs0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],0] to 
cpus 0001
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],1] to 
cpus 0002
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],2] to 
cpus 0004
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node linpc0. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bysocket 
init_finalize
--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M).  Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.

You job will now abort.
--------------------------------------------------------------------------
[linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],0] to 
socket 0 cpus 0001
[linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],1] to 
socket 0 cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node linpc0. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bycore 
init_finalize
--------------------------------------------------------------------------
Unable to bind to socket 0 on node linpc0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Fatal
Node: linpc0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start


##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bysocket 
init_finalize
--------------------------------------------------------------------------
Unable to bind to socket 0 on node linpc0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an 
error:

Error name: Fatal
Node: linpc0

when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start



Hopefully this helps to track the error. Thank you very much
for your help in advance.


Kind regards

Siegmar



> > I wrapped long lines so that they
> > are easier to read. Have you had time to look at the
> > segmentation fault with a rankfile which I reported in my
> > last email (see below)?
> 
> I'm afraid not - been too busy lately. I'd suggest first focusing
> on getting binding to work.
> 
> > 
> > "tyr" is a two processor single core machine.
> > 
> > tyr fd1026 116 mpiexec -report-bindings -np 4 \
> >  -bind-to-socket -bycore rank_size
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> >  fork binding child [[27298,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> >  fork binding child [[27298,1],1] to socket 1 cpus 0002
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> >  fork binding child [[27298,1],2] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> >  fork binding child [[27298,1],3] to socket 1 cpus 0002
> > I'm process 0 of 4 ...
> > 
> > 
> > tyr fd1026 121 mpiexec -report-bindings -np 4 \
> > -bind-to-socket -bysocket rank_size
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> >  fork binding child [[27380,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> >  fork binding child [[27380,1],1] to socket 1 cpus 0002
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> >  fork binding child [[27380,1],2] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> >  fork binding child [[27380,1],3] to socket 1 cpus 0002
> > I'm process 0 of 4 ...
> > 
> > 
> > tyr fd1026 117 mpiexec -report-bindings -np 4 \
> >  -bind-to-core -bycore rank_size
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> >  fork binding child [[27307,1],2] to cpus 0004
> > ------------------------------------------------------------------
> > An attempt to set processor affinity has failed - please check to
> > ensure that your system supports such functionality. If so, then
> > this is probably something that should be reported to the OMPI
> >  developers.
> > ------------------------------------------------------------------
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> >  fork binding child [[27307,1],0] to cpus 0001
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> >  fork binding child [[27307,1],1] to cpus 0002
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application
> >  as it encountered an error
> > on node tyr.informatik.hs-fulda.de. More information may be
> >  available above.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> > 
> > 
> > 
> > tyr fd1026 118 mpiexec -report-bindings -np 4 \
> >  -bind-to-core -bysocket rank_size
> > ------------------------------------------------------------------
> > An invalid physical processor ID was returned when attempting to
> >  bind
> > an MPI process to a unique processor.
> > 
> > This usually means that you requested binding to more processors
> >  than
> > 
> > exist (e.g., trying to bind N MPI processes to M processors,
> >  where N >
> > M).  Double check that you have enough unique processors for
> >  all the
> > MPI processes that you are launching on this host.
> > 
> > You job will now abort.
> > ------------------------------------------------------------------
> > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
> >  fork binding child [[27347,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
> >  fork binding child [[27347,1],1] to socket 1 cpus 0002
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> >  encountered an error
> > on node tyr.informatik.hs-fulda.de. More information may be
> >  available above.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> > tyr fd1026 119 
> > 
> > 
> > 
> > "linpc3" and "linpc4" are two processor dual core machines.
> > 
> > linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
> > -np 4 -bind-to-core -bycore rank_size
> > [linpc4:16842] [[40914,0],0] odls:default:
> >  fork binding child [[40914,1],1] to cpus 0001
> > [linpc4:16842] [[40914,0],0] odls:default:
> >  fork binding child [[40914,1],3] to cpus 0002
> > [linpc3:31384] [[40914,0],1] odls:default:
> >  fork binding child [[40914,1],0] to cpus 0001
> > [linpc3:31384] [[40914,0],1] odls:default:
> >  fork binding child [[40914,1],2] to cpus 0002
> > I'm process 1 of 4 ...
> > 
> > 
> > linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
> >  -np 4 -bind-to-core -bysocket rank_size
> > [linpc4:16846] [[40918,0],0] odls:default:
> >  fork binding child [[40918,1],1] to socket 0 cpus 0001
> > [linpc4:16846] [[40918,0],0] odls:default:
> >  fork binding child [[40918,1],3] to socket 0 cpus 0002
> > [linpc3:31435] [[40918,0],1] odls:default:
> >  fork binding child [[40918,1],0] to socket 0 cpus 0001
> > [linpc3:31435] [[40918,0],1] odls:default:
> >  fork binding child [[40918,1],2] to socket 0 cpus 0002
> > I'm process 1 of 4 ...
> > 
> > 
> > 
> > 
> > linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
> >  -np 4 -bind-to-socket -bycore rank_size
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc3.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc4.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> >  encountered an error:
> > 
> > Error name: Fatal
> > Node: linpc4
> > 
> > when attempting to start process rank 1.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> > linpc4 fd1026 105 
> > 
> > 
> > linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \
> >  -np 4 -bind-to-socket -bysocket rank_size
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc4.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc3.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> >  encountered an error:
> > 
> > Error name: Fatal
> > Node: linpc4
> > 
> > when attempting to start process rank 1.
> > --------------------------------------------------------------------------
> > 4 total processes failed to start
> > 
> > 
> > It's interesting that commands that work on Solaris fail on Linux
> > and vice versa.
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> >>> I couldn't really say for certain - I don't see anything obviously
> >>> wrong with your syntax, and the code appears to be working or else
> >>> it would fail on the other nodes as well. The fact that it fails
> >>> solely on that machine seems suspect.
> >>> 
> >>> Set aside the rankfile for the moment and try to just bind to cores
> >>> on that machine, something like:
> >>> 
> >>> mpiexec --report-bindings -bind-to-core
> >>>  -host rs0.informatik.hs-fulda.de -n 2 rank_size
> >>> 
> >>> If that doesn't work, then the problem isn't with rankfile
> >> 
> >> It doesn't work but I found out something else as you can see below.
> >> I get a segmentation fault for some rankfiles.
> >> 
> >> 
> >> tyr small_prog 110 mpiexec --report-bindings -bind-to-core
> >>  -host rs0.informatik.hs-fulda.de -n 2 rank_size
> >> --------------------------------------------------------------------------
> >> An attempt to set processor affinity has failed - please check to
> >> ensure that your system supports such functionality. If so, then
> >> this is probably something that should be reported to the OMPI developers.
> >> --------------------------------------------------------------------------
> >> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default:
> >>  fork binding child [[30561,1],0] to cpus 0001
> >> --------------------------------------------------------------------------
> >> mpiexec was unable to start the specified application as it
> >>  encountered an error:
> >> 
> >> Error name: Resource temporarily unavailable
> >> Node: rs0.informatik.hs-fulda.de
> >> 
> >> when attempting to start process rank 0.
> >> --------------------------------------------------------------------------
> >> 2 total processes failed to start
> >> tyr small_prog 111 
> >> 
> >> 
> >> 
> >> 
> >> Perhaps I have a hint for the error on Solaris Sparc. I use the
> >> following rankfile to keep everything simple.
> >> 
> >> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
> >> rank 2=linpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
> >> rank 4=linpc3.informatik.hs-fulda.de slot=0:0
> >> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
> >> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
> >> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
> >> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
> >> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
> >> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
> >> 
> >> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size"
> >> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine.
> >> 
> >> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [linpc4:08018] [[49482,0],0] odls:default:fork binding child
> >>  [[49482,1],5] to slot_list 0:0
> >> [linpc3:22030] [[49482,0],4] odls:default:fork binding child
> >>  [[49482,1],4] to slot_list 0:0
> >> [linpc0:12887] [[49482,0],2] odls:default:fork binding child
> >>  [[49482,1],1] to slot_list 0:0
> >> [linpc1:08323] [[49482,0],3] odls:default:fork binding child
> >>  [[49482,1],2] to slot_list 0:0
> >> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child
> >>  [[49482,1],7] to slot_list 0:0
> >> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork
> >>  binding child [[49482,1],9] to slot_list 0:0
> >> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork
> >>  binding child [[49482,1],6] to slot_list 0:0
> >> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork
> >>  binding child [[49482,1],0] to slot_list 0:0
> >> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork
> >>  binding child [[49482,1],8] to slot_list 0:0
> >> ...
> >> 
> >> 
> >> 
> >> I get a segmentation fault when I run it on my local machine
> >> (Solaris Sparc).
> >> 
> >> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG:
> >>  Data unpack would read past end of buffer in file
> >>  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >>  at line 927
> >> [tyr:21421] *** Process received signal ***
> >> [tyr:21421] Signal: Segmentation Fault (11)
> >> [tyr:21421] Signal code: Address not mapped (1)
> >> [tyr:21421] Failing at address: 5ba
> >> 
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec
> >> /lib/libc.so.1:0xcad04
> >> /lib/libc.so.1:0xbf3b4
> >> /lib/libc.so.1:0xbf59c
> >> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)]
> >> /lib/libc.so.1:free+0x24
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  orte_odls_base_default_construct_child_list+0x1234
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/
> >>  mca_odls_default.so:0x90b8
> >> 
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  orte_daemon_cmd_processor+0x328
> >> 
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  opal_event_base_loop+0x228
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  opal_progress+0xec
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  orte_plm_base_report_launched+0x1c4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >>  orte_plm_base_launch_apps+0x318
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so:
> >>  orte_plm_rsh_launch+0xac4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8
> >> [tyr:21421] *** End of error message ***
> >> Segmentation fault
> >> tyr small_prog 142 
> >> 
> >> 
> >> The funny thing is that I get a segmentation fault on the Linux
> >> machine as well if I change my rankfile in the following way.
> >> 
> >> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
> >> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
> >> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0
> >> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
> >> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
> >> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
> >> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
> >> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
> >> 
> >> 
> >> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would
> >>  read past end of buffer in file 
> >>  ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >>  at line 927
> >> [linpc4:08402] *** Process received signal ***
> >> [linpc4:08402] Signal: Segmentation fault (11)
> >> [linpc4:08402] Signal code: Address not mapped (1)
> >> [linpc4:08402] Failing at address: 0x5f32fffc
> >> [linpc4:08402] [ 0] [0xffffe410]
> >> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
> >>  mca_odls_default.so(+0x4023) [0xf73ec023]
> >> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(+0x42b91) [0xf7667b91]
> >> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3]
> >> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(+0x8f366) [0xf76b4366]
> >> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc]
> >> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526]
> >> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(opal_progress+0xba) [0xf769303a]
> >> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f]
> >> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/
> >>  libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27]
> >> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
> >>  mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd]
> >> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf]
> >> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273]
> >> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003]
> >> [linpc4:08402] *** End of error message ***
> >> Segmentation fault
> >> linpc4 small_prog 107 
> >> 
> >> 
> >> Hopefully this information helps to fix the problem.
> >> 
> >> 
> >> Kind regards
> >> 
> >> Siegmar
> >> 
> >> 
> >> 
> >> 
> >>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross 
> > <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>> 
> >>>> Hi,
> >>>> 
> >>>> I'm new to rankfiles so that I played a little bit with different
> >>>> options. I thought that the following entry would be similar to an
> >>>> entry in an appfile and that MPI could place the process with rank 0
> >>>> on any core of any processor.
> >>>> 
> >>>> rank 0=tyr.informatik.hs-fulda.de
> >>>> 
> >>>> Unfortunately it's not allowed and I got an error. Can somebody add
> >>>> the missing help to the file?
> >>>> 
> >>>> 
> >>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
> >>>> 
--------------------------------------------------------------------------
> >>>> Sorry!  You were supposed to get help about:
> >>>>   no-slot-list
> >>>> from the file:
> >>>>   help-rmaps_rank_file.txt
> >>>> But I couldn't find that topic in the file.  Sorry!
> >>>> 
--------------------------------------------------------------------------
> >>>> 
> >>>> 
> >>>> As you can see below I could use a rankfile on my old local machine
> >>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
> >>>> logged into the machine via ssh and tried the same command once more
> >>>> as a local user without success. It's more or less the same error as
> >>>> before when I tried to bind the process to a remote machine.
> >>>> 
> >>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
> >>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
> >>>> binding child [[19734,1],0] to slot_list 0:0
> >>>> 
--------------------------------------------------------------------------
> >>>> We were unable to successfully process/set the requested processor
> >>>> affinity settings:
> >>>> 
> >>>> Specified slot list: 0:0
> >>>> Error: Cross-device link
> >>>> 
> >>>> This could mean that a non-existent processor was specified, or
> >>>> that the specification had improper syntax.
> >>>> 
--------------------------------------------------------------------------
> >>>> 
--------------------------------------------------------------------------
> >>>> mpiexec was unable to start the specified application as it encountered 
an 
> > error:
> >>>> 
> >>>> Error name: No such file or directory
> >>>> Node: rs0.informatik.hs-fulda.de
> >>>> 
> >>>> when attempting to start process rank 0.
> >>>> 
--------------------------------------------------------------------------
> >>>> rs0 small_prog 119 
> >>>> 
> >>>> 
> >>>> The application is available.
> >>>> 
> >>>> rs0 small_prog 119 which rank_size
> >>>> /home/fd1026/SunOS/sparc/bin/rank_size
> >>>> 
> >>>> 
> >>>> Is it a problem in the Open MPI implementation or in my rankfile?
> >>>> How can I request which sockets and cores per socket are
> >>>> available so that I can use correct values in my rankfile?
> >>>> In lam-mpi I had a command "lamnodes" which I could use to get
> >>>> such information. Thank you very much for any help in advance.
> >>>> 
> >>>> 
> >>>> Kind regards
> >>>> 
> >>>> Siegmar
> >>>> 
> >>>> 
> >>>> 
> >>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
> >>>>> 
> >>>>> Yes, both machines are Sparc. I tried first in a homogeneous
> >>>>> environment.
> >>>>> 
> >>>>> tyr fd1026 106 psrinfo -v
> >>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
> >>>>> on-line since 08/31/2012 15:44:42.
> >>>>> The sparcv9 processor operates at 1600 MHz,
> >>>>>       and has a sparcv9 floating point processor.
> >>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
> >>>>> on-line since 08/31/2012 15:44:39.
> >>>>> The sparcv9 processor operates at 1600 MHz,
> >>>>>       and has a sparcv9 floating point processor.
> >>>>> tyr fd1026 107 
> >>>>> 
> >>>>> My local machine (tyr) is a dual processor machine and the
> >>>>> other one is equipped with two quad-core processors each
> >>>>> capable of running two hardware threads.
> >>>>> 
> >>>>> 
> >>>>> Kind regards
> >>>>> 
> >>>>> Siegmar
> >>>>> 
> >>>>> 
> >>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross 
> >>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> >>>>>> 
> >>>>>>> Hi,
> >>>>>>> 
> >>>>>>> the man page for "mpiexec" shows the following:
> >>>>>>> 
> >>>>>>>       cat myrankfile
> >>>>>>>       rank 0=aa slot=1:0-2
> >>>>>>>       rank 1=bb slot=0:0,1
> >>>>>>>       rank 2=cc slot=1-2
> >>>>>>>       mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
> >>>>>>> 
> >>>>>>>     Rank 0 runs on node aa, bound to socket 1, cores 0-2.
> >>>>>>>     Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
> >>>>>>>     Rank 2 runs on node cc, bound to cores 1 and 2.
> >>>>>>> 
> >>>>>>> Does it mean that the process with rank 0 should be bound to
> >>>>>>> core 0, 1, or 2 of socket 1?
> >>>>>>> 
> >>>>>>> I tried to use a rankfile and have a problem. My rankfile contains
> >>>>>>> the following lines.
> >>>>>>> 
> >>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
> >>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Everything is fine if I use the file with just my local machine
> >>>>>>> (the first two lines).
> >>>>>>> 
> >>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
> >>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> >>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> >>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
> >>>>>>> I'm process 0 of 2 available processes running on 
> >>>>> tyr.informatik.hs-fulda.de.
> >>>>>>> MPI standard 2.1 is supported.
> >>>>>>> I'm process 1 of 2 available processes running on 
> >>>>> tyr.informatik.hs-fulda.de.
> >>>>>>> MPI standard 2.1 is supported.
> >>>>>>> tyr small_prog 116 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> I can also change the socket number and the processes will be attached
> >>>>>>> to the correct cores. Unfortunately it doesn't work if I add one
> >>>>>>> other machine (third line).
> >>>>>>> 
> >>>>>>> 
> >>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
> >>>>>>> 
> > --------------------------------------------------------------------------
> >>>>>>> We were unable to successfully process/set the requested processor
> >>>>>>> affinity settings:
> >>>>>>> 
> >>>>>>> Specified slot list: 0:0
> >>>>>>> Error: Cross-device link
> >>>>>>> 
> >>>>>>> This could mean that a non-existent processor was specified, or
> >>>>>>> that the specification had improper syntax.
> >>>>>>> 
> > --------------------------------------------------------------------------
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
> >>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
> >>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
> >>>>>>> whose contact information is unknown in file
> >>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
> >>>>>>> to [[10212,1],0]: tag 20
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
> >>>>>>> A message is attempting to be sent to a process whose contact
> >>>>>>> information is unknown in file
> >>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >>>>>>> at line 2501
> >>>>>>> 
> > --------------------------------------------------------------------------
> >>>>>>> mpiexec was unable to start the specified application as it
> >>>>>>> encountered an error:
> >>>>>>> 
> >>>>>>> Error name: Error 0
> >>>>>>> Node: rs0.informatik.hs-fulda.de
> >>>>>>> 
> >>>>>>> when attempting to start process rank 2.
> >>>>>>> 
> > --------------------------------------------------------------------------
> >>>>>>> tyr small_prog 113 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> The other machine has two 8 core processors.
> >>>>>>> 
> >>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v
> >>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
> >>>>>>> on-line since 07/26/2012 15:03:14.
> >>>>>>> The sparcv9 processor operates at 2400 MHz,
> >>>>>>>      and has a sparcv9 floating point processor.
> >>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
> >>>>>>> ...
> >>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
> >>>>>>> on-line since 07/26/2012 15:03:16.
> >>>>>>> The sparcv9 processor operates at 2400 MHz,
> >>>>>>>      and has a sparcv9 floating point processor.
> >>>>>>> tyr small_prog 122 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Is it necessary to specify another option on the command line or
> >>>>>>> is my rankfile faulty? Thank you very much for any suggestions in
> >>>>>>> advance.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Kind regards
> >>>>>>> 
> >>>>>>> Siegmar
> >>>>>>> 
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> 
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> 
> >>> 
> > 
> 
> 

Reply via email to