[O-MPI users] Problems running svn7132 with more than on proc per node

2005-09-02 Thread Peter Kjellström
Hello,

I'm playing with a copy of svn7132 that built and installed just fine. At 
first everything seemed ok, unlike earlier it now runs on mvapi 
automagically :-)

But then a small testprogram failed and then another. After scratching my head 
a while I realised that the pattern was that as soon as I had two ranks 
sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted)

Here is a bidirect point-to-point running two ranks on the same host (this one 
actually starts but segfaults half way through):

NODEFILE is "n50 n50"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 2 mpibibench.ompi7132
Using Zero pattern.
starting _bidirect_ lat-bw test.
Latency:   1.8 µsec   (total)Bandwidth:   0.0 bytes/s (0 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   1.0 Mbytes/s (1 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   2.0 Mbytes/s (2 x 1)
Latency:   1.9 µsec   (total)Bandwidth:   4.2 Mbytes/s (4 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   8.1 Mbytes/s (8 x 1)
Latency:   2.2 µsec   (total)Bandwidth:  14.8 Mbytes/s (16 x 1)
Latency:   2.0 µsec   (total)Bandwidth:  31.7 Mbytes/s (32 x 1)
Latency:   2.2 µsec   (total)Bandwidth:  57.3 Mbytes/s (64 x 1)
Latency:   2.2 µsec   (total)Bandwidth: 114.3 Mbytes/s (128 x 1)
Latency:   2.3 µsec   (total)Bandwidth: 224.8 Mbytes/s (256 x 1)
Latency:   2.8 µsec   (total)Bandwidth: 369.8 Mbytes/s (512 x 1)
mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal 
11.
1 additional process aborted (not shown)

from dmesg:
mpibibench.ompi[5879]: segfault at  rip  rsp 
007fbfffe8e8 error 14

running on more than one node seems to die instantly (simple all-to-all app):

NODEFILE is "n50 n50 n49 n49"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 4 alltoall.ompi7132
mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal 
11.
3 additional processes aborted (not shown)

and with similar segfault on dmesg

Either running with one proc per node or skipping mpi_leave_pinned makes it 
work 100% Is this expected?

tia,
 Peter

System config:
OS: centos-4.1 x86_64 2.6.9-11smp (el4u1)
ompi: svn7132 vpath build with recommended libtool/autoconf/automake
compilers: 64-bit icc/ifort 8.1-029
configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx 
--disable-f90 --disable-io-romio

-- 

  Peter Kjellström   |
  National Supercomputer Centre  |
  Sweden | http://www.nsc.liu.se


pgpg3mftJmPTK.pgp
Description: PGP signature


Re: [O-MPI users] open-mpi team - good job and a question

2005-09-02 Thread Jeff Squyres

On Sep 1, 2005, at 9:11 PM, Borenstein, Bernard S wrote:

I tried a recent tarball of open-mpi and it worked well on my 64 bit 
linux cluster running the Nasa Overflow 1.8ab cfd code.  I'm looking 
forward to the 1.0 release of the product.  The only strange thing I 
noticed is that the output file is written in DOS format (with CR and 
LF) on my unix system. Did I miss something or is this the way it is 
supposed to work??


I'm not sure that I understand your questions -- Open MPI shouldn't 
really have anything to do with the output file of your application 
unless it is written with MPI-2 I/O functions.  Is this a file that 
Overflow itself generates?


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI users] Problems running svn7132 with more than on proc per node

2005-09-02 Thread Tim S. Woodall

Thanks Peter,

We'll look into this...

Tim


Peter Kjellström wrote:

Hello,

I'm playing with a copy of svn7132 that built and installed just fine. At 
first everything seemed ok, unlike earlier it now runs on mvapi 
automagically :-)


But then a small testprogram failed and then another. After scratching my head 
a while I realised that the pattern was that as soon as I had two ranks 
sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted)


Here is a bidirect point-to-point running two ranks on the same host (this one 
actually starts but segfaults half way through):


NODEFILE is "n50 n50"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 2 mpibibench.ompi7132

Using Zero pattern.
starting _bidirect_ lat-bw test.
Latency:   1.8 µsec   (total)Bandwidth:   0.0 bytes/s (0 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   1.0 Mbytes/s (1 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   2.0 Mbytes/s (2 x 1)
Latency:   1.9 µsec   (total)Bandwidth:   4.2 Mbytes/s (4 x 1)
Latency:   2.0 µsec   (total)Bandwidth:   8.1 Mbytes/s (8 x 1)
Latency:   2.2 µsec   (total)Bandwidth:  14.8 Mbytes/s (16 x 1)
Latency:   2.0 µsec   (total)Bandwidth:  31.7 Mbytes/s (32 x 1)
Latency:   2.2 µsec   (total)Bandwidth:  57.3 Mbytes/s (64 x 1)
Latency:   2.2 µsec   (total)Bandwidth: 114.3 Mbytes/s (128 x 1)
Latency:   2.3 µsec   (total)Bandwidth: 224.8 Mbytes/s (256 x 1)
Latency:   2.8 µsec   (total)Bandwidth: 369.8 Mbytes/s (512 x 1)
mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal 
11.

1 additional process aborted (not shown)

from dmesg:
mpibibench.ompi[5879]: segfault at  rip  rsp 
007fbfffe8e8 error 14


running on more than one node seems to die instantly (simple all-to-all app):

NODEFILE is "n50 n50 n49 n49"
[cap@n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1 
--np 4 alltoall.ompi7132
mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal 
11.

3 additional processes aborted (not shown)

and with similar segfault on dmesg

Either running with one proc per node or skipping mpi_leave_pinned makes it 
work 100% Is this expected?


tia,
 Peter

System config:
OS: centos-4.1 x86_64 2.6.9-11smp (el4u1)
ompi: svn7132 vpath build with recommended libtool/autoconf/automake
compilers: 64-bit icc/ifort 8.1-029
configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx 
--disable-f90 --disable-io-romio






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [O-MPI users] open-mpi team - good job and a question

2005-09-02 Thread Brian Barrett

On Sep 1, 2005, at 8:11 PM, Borenstein, Bernard S wrote:

I tried a recent tarball of open-mpi and it worked well on my 64  
bit linux cluster running the Nasa Overflow 1.8ab


cfd code.  I’m looking forward to the 1.0 release of the product.   
The only strange thing I noticed is that the output file


is written in DOS format (with CR and LF) on my unix system. Did I  
miss something or is this the way it


is supposed to work??


I'm not familiar with the NASA Overflow code.  I'm assuming that you  
are talking about output written via stdout/stderr?  If so, we play  
some small games with ptys to do I/O redirection, but I haven't seen  
anything like this before.  Can you provide any more information  
(where the output is going to, how the DOS/UNIX format is selected,  
etc) that might help pin the issue down?


Brian