Hi,
First of all thanks for your insight !
Do you get a corefile?
I don't get a core file, but I get a file called _FIL001. It doesn't contain any
debugging symbols. It's most likely a digested version of the input file given to
the executable : ./myexec < inputfile.
there's no line numbers printed in the stack trace
I would love to see those, but even if I compile openmpi with -debug -mem-debug
-mem-profile, they don't show up. I recompiled my sources to be sure to
properly link them to the newly debugged version of openmpi. I assumed I didn't
need to compile my own sources with -g option since it crashes in openmpi
itself ? I didn't try to run mpiexec via gdb either, I guess it wont help since
I already get the trace.
the -fdefault-integer-8 options ought to be highly dangerous
Thanks for noting. Indeed I had some issues with this option. For instance I
have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in :
CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD,
mpierr)" I checked that count is never bigger than 2000 (as you mentioned it could
flip to the negative). However I haven't declared it as INTEGER*4 and I think I should.
When I said "I had to raise the number of data strucutures to be sent", I meant
that I had to call MPI_SEND many more times, not that buffers were bigger than before.
I'll get back to you with more info when I'll be able to fix my connexion
problem to the cluster...
Thanks,
Benjamin
2010/12/3 Martin Siegert <sieg...@sfu.ca>
Hi All,
just to expand on this guess ...
On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
Hi All
I wonder if configuring OpenMPI while
forcing the default types to non-default values
(-fdefault-integer-8 -fdefault-real-8) might have
something to do with the segmentation fault.
Would this be effective, i.e., actually make the
the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
or just elusive?
I believe what happens is that this mostly affects the fortran
wrapper routines and the way Fortran variables are mapped to C:
MPI_INTEGER -> MPI_LONG
MPI_FLOAT -> MPI_DOUBLE
MPI_DOUBLE_PRECISION -> MPI_DOUBLE
In that respect I believe that the -fdefault-real-8 option is harmless,
i.e., it does the expected thing.
But the -fdefault-integer-8 options ought to be highly dangerous:
It works for integer variables that are used as "buffer" arguments
in MPI statements, but I would assume that this does not work for
"count" and similar arguments.
Example:
integer, allocatable :: buf(*,*)
integer i, count, dest, tag, mpierr
i = 32768
i2 = 2*i
allocate(buf(i,i2))
count = i*i2
buf = 1
call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
Now count is 2^31 which overflows a 32bit integer.
The MPI standard requires that count is a 32bit integer, correct?
Thus while buf gets the type MPI_LONG, count remains an int.
Is this interpretation correct? If it is, then you are calling
MPI_Send with a count argument of -2147483648.
Which could result in a segmentation fault.
Cheers,
Martin
--
Martin Siegert
Head, Research Computing
WestGrid/ComputeCanada Site Lead
IT Services phone: 778 782-4691
Simon Fraser University fax: 778 782-4242
Burnaby, British Columbia email: sieg...@sfu.ca
Canada V5A 1S6
There were some recent discussions here about MPI
limiting counts to MPI_INTEGER.
Since Benjamin said he "had to raise the number of data structures",
which eventually led to the the error,
I wonder if he is inadvertently flipping to negative integer
side of the 32-bit universe (i.e. >= 2**31), as was reported here by
other list subscribers a few times.
Anyway, segmentation fault can come from many different places,
this is just a guess.
Gus Correa
Jeff Squyres wrote:
Do you get a corefile?
It looks like you're calling MPI_RECV in Fortran and then it segv's. This is
*likely* because you're either passing a bad parameter or your buffer isn't big
enough. Can you double check all your parameters?
Unfortunately, there's no line numbers printed in the stack trace, so it's not
possible to tell exactly where in the ob1 PML it's dying (i.e., so we can't see
exactly what it's doing to cause the segv).
On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote:
Hi,
I am using DRAGON, a neutronic simulation code in FORTRAN77 that has its own
datastructures. I added a module to send these data structures thanks to
MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a while.
Then I had to raise the number of data structures to be sent up to a point
where my cluster has this bug :
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x2c2579fc0
[ 0] /lib/libpthread.so.0 [0x7f52d2930410]
[ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03]
[ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2) [0x7f52d3504a1e]
[ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e) [0x7f52d36cf9c6]
How can I make this error more explicit ?
I use the following configuration of openmpi-1.4.3 :
./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran
FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64
-fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" --disable-mpi-f90
Here is the output of mpif77 -v :
mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51
Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt
-lfrtbegin -lg2c -lm -shared-libgcc
Lecture des spécification à partir de /usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs
Configuré avec: ../src/configure -v --enable-languages=c,c++,f77,pascal
--prefix=/usr --libexecdir=/usr/lib --with-gxx-include-dir=/usr/include/c++/3.4
--enable-shared --with-system-zlib --enable-nls --without-included-gettext
--program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu
--enable-libstdcxx-debug x86_64-linux-gnu
Modèle de thread: posix
version gcc 3.4.6 (Debian 3.4.6-5)
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m elf_x86_64
-dynamic-linker /lib64/ld-linux-x86-64.so.2
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crti.o
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtbegin.o -L/usr/lib/mpich-mpd/lib
-L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6
-L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib
-L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../.. -L/lib/../lib -L/usr/lib/../lib
-lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm -lgcc_s -lgcc -lc -lgcc_s
-lgcc /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtend.o
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crtn.o
/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/libfrtbegin.a(frtbegin.o):
dans la fonction ▒ main ▒:
(.text+0x1e): référence indéfinie vers ▒ MAIN__ ▒
collect2: ld a retourné 1 code d'état d'exécution
Thanks,
Benjamin
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users