Hello,

I'm playing with a copy of svn7132 that built and installed just fine. At 
first everything seemed ok, unlike earlier it now runs on mvapi 
automagically :-)

But then a small testprogram failed and then another. After scratching my head 
a while I realised that the pattern was that as soon as I had two ranks 
sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted)

Here is a bidirect point-to-point running two ranks on the same host (this one 
actually starts but segfaults half way through):

NODEFILE is "n50 n50"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 2 mpibibench.ompi7132
Using Zero pattern.
starting _bidirect_ lat-bw test.
Latency:   1.8 µsec   (total)Bandwidth:   0.0 bytes/s (0 x 10000)
Latency:   2.0 µsec   (total)Bandwidth:   1.0 Mbytes/s (1 x 10000)
Latency:   2.0 µsec   (total)Bandwidth:   2.0 Mbytes/s (2 x 10000)
Latency:   1.9 µsec   (total)Bandwidth:   4.2 Mbytes/s (4 x 10000)
Latency:   2.0 µsec   (total)Bandwidth:   8.1 Mbytes/s (8 x 10000)
Latency:   2.2 µsec   (total)Bandwidth:  14.8 Mbytes/s (16 x 10000)
Latency:   2.0 µsec   (total)Bandwidth:  31.7 Mbytes/s (32 x 10000)
Latency:   2.2 µsec   (total)Bandwidth:  57.3 Mbytes/s (64 x 10000)
Latency:   2.2 µsec   (total)Bandwidth: 114.3 Mbytes/s (128 x 10000)
Latency:   2.3 µsec   (total)Bandwidth: 224.8 Mbytes/s (256 x 10000)
Latency:   2.8 µsec   (total)Bandwidth: 369.8 Mbytes/s (512 x 10000)
mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal 
11.
1 additional process aborted (not shown)

from dmesg:
mpibibench.ompi[5879]: segfault at 0000000000000000 rip 0000000000000000 rsp 
0000007fbfffe8e8 error 14

running on more than one node seems to die instantly (simple all-to-all app):

NODEFILE is "n50 n50 n49 n49"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 4 alltoall.ompi7132
mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal 
11.
3 additional processes aborted (not shown)

and with similar segfault on dmesg

Either running with one proc per node or skipping mpi_leave_pinned makes it 
work 100% Is this expected?

tia,
 Peter

System config:
OS: centos-4.1 x86_64 2.6.9-11smp (el4u1)
ompi: svn7132 vpath build with recommended libtool/autoconf/automake
compilers: 64-bit icc/ifort 8.1-029
configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx 
--disable-f90 --disable-io-romio

-- 
------------------------------------------------------------
  Peter Kjellström               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se

Attachment: pgpg3mftJmPTK.pgp
Description: PGP signature

Reply via email to