Many thanks for all this information. Unfortunately, it's not enough
to know what's going on. :-(
Do you know for sure that the application is correct? E.g., is it
possible that a bad buffer is being passed to MPI_Isend? I note that
it is fairly odd to fail in MPI_Isend itself because that function is
actually pretty short -- it mainly checks parameters and then calls a
back-end Open MPI function to actually do the send.
Do you get corefiles with the killed processes, and can you analyze
where the application failed? If so, can you verify that all state in
the application appears to be correct? It might be helpful to analyze
exactly where the application failed (e.g., compile at least ompi/mpi/
c/isend.c with the -g flag so that you can get some debugging
information about exactly where in MPI_Isend it failed -- like I said,
it's a short function that mainly checks parameters). You might want
to have your application double check all the parameters that are
passed to MPI_Isend, too.
On Oct 26, 2009, at 9:43 AM, Iris Pernille Lohmann wrote:
Dear list members
I am using openmpi 1.3.3 with OFED on a HP cluster with redhatLinux.
Occasionally (not always) I get a crash with the following message:
[hydra11:09312] *** Process received signal ***
[hydra11:09312] Signal: Segmentation fault (11)
[hydra11:09312] Signal code: Address not mapped (1)
[hydra11:09312] Failing at address: 0xffffffffab5f30a8
[hydra11:09312] [ 0] /lib64/libpthread.so.0 [0x3c1400e4c0]
[hydra11:09312] [ 1] /home/ipl/openmpi-1.3.3/platforms/hp/lib/
libmpi.so.0(MPI_Isend+0x93) [0x2af1be45a3e3]
[hydra11:09312] [ 2] ./flow(MP_SendReal+0x60) [0x6bc993]
[hydra11:09312] [ 3] ./flow(SendRealsAlongFaceWithOffset_3D+0x4ab)
[0x68ba19]
[hydra11:09312] [ 4] ./flow(MP_SendVertexArrayBlock+0x23d) [0x6891e1]
[hydra11:09312] [ 5] ./flow(MB_CommAllVertex+0x65) [0x6848ba]
[hydra11:09312] [ 6] ./flow(MB_SetupVertexArray+0xd5) [0x68c837]
[hydra11:09312] [ 7] ./flow(MB_SetupGrid+0xa8) [0x68be51]
[hydra11:09312] [ 8] ./flow(SetGrid+0x58) [0x446224]
[hydra11:09312] [ 9] ./flow(main+0x148) [0x43b728]
[hydra11:09312] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3c1341d974]
[hydra11:09312] [11] ./flow(__gxx_personality_v0+0xd9) [0x429b19]
[hydra11:09312] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 9312 on node hydra11
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The crash does not appear always – sometimes the application runs
fine. However, it seems that the crash especially occurs when I run
on more than 1 node.
I have consulted the archive of open-mpi and have found many error
messages of the same kind, but none from the 1.3.3 version, and none
of direct relevance.
I would really appreciate comments on this. Below is the information
required according to the openmpi web,
Config.log: attached (config.zip)
Open mpi was configured with prefix and with the path to openib, and
with the following compiler flags
setenv CC gcc
setenv CFLAGS '-O'
setenv CXX g++
setenv CXXFLAGS '-O'
setenv F77 'gfortran'
setenv FFLAGS '-O'
ompi_info –all:
attached
The application (named flow) was launched on hydra11 by
nohup mpirun –H hydra11,hydra12 –np 8 ./flow caseC.in &
the PATH and LD_LIBRARY_PATH, hydra11 and hydra12:
PATH=/home/ipl/openmpi-1.3.3/platforms/hp/bin
LD_LIBRARY_PATH= /home/ipl/openmpi-1.3.3/platforms/hp/lib
OpenFabrics version: 1.4
Linux:
X86_64-redhat-linux/3.4.6
ibv_devinfo, hydra11: attached
ibv_devinfo, hydra12: attached
ifconfig, hydra11: attached
ifconfig, hydra12: attached
ulimit –l (hydra11): 6000000
ulimit –l (hydra12): unlimited
Furthermore, I can say that I have not specified any MCA parameters.
The application which I am running (named flow) is linked from
fortran, c and c++ libraries with the following:
/home/ipl/openmpi-1.3.3/platforms/hp/bin/mpicc -DMP -
DNS3_ARCH_LINUX -DLAPACK -I/home/ipl/ns3/engine/include_forLinux -I/
home/ipl/openmpi-1.3.3/platforms/hp/include -c -o user_small_3D.o
user_small_3D.c
rm -f flow
/home/ipl/openmpi-1.3.3/platforms/hp/bin/mpicxx -o flow
user_small_3D.o -L/home/ipl/ns3/engine/lib_forLinux -lns3main -
lns3pars -lns3util -lns3vofl -lns3turb -lns3solv -lns3mesh -lns3diff
-lns3grid -lns3line -lns3data -lns3base -lfitpack -lillusolve -
lfftpack_small -lfenton -lns3air -lns3dens -lns3poro -lns3sedi -
llapack_small -lblas_small -lm -lgfortran /home/ipl/ns3/engine/
lib_Tecplot_forLinux/tecio64.a
Please let me know if you need more info!
Thanks in advance,
Iris Lohmann
Iris Pernille Lohmann
MSc, PhD
Ports & Offshore Technology (POT)
<image001.gif>
DHI
Agern Allé 5
DK-2970 Hørsholm
Denmark
Tel:
+45 4516 9200
Direct:
45169427
i...@dhigroup.com
www.dhigroup.com
WATER • ENVIRONMENT • HEALTH
*****************************************************************************
** **
** WARNING: This email contains an attachment of a very suspicious
type. **
** You are urged NOT to open this attachment unless you are
absolutely **
** sure it is legitimate. Opening this attachment may cause
irreparable **
** damage to your computer and your files. If you have any
questions **
** about the validity of this message, PLEASE SEEK HELP BEFORE
OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail
scanner. **
*****************************************************************************
<
config
.zip
>
<
ompi_info_all
.zip
>
<
ibv_devinfo_hydra11
.out
>
<
ibv_devinfo_hydra12
.out
>
<
ifconfig_hydra11
.out
><ifconfig_hydra12.out>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com