Many thanks for all this information. Unfortunately, it's not enough to know what's going on. :-(

Do you know for sure that the application is correct? E.g., is it possible that a bad buffer is being passed to MPI_Isend? I note that it is fairly odd to fail in MPI_Isend itself because that function is actually pretty short -- it mainly checks parameters and then calls a back-end Open MPI function to actually do the send.

Do you get corefiles with the killed processes, and can you analyze where the application failed? If so, can you verify that all state in the application appears to be correct? It might be helpful to analyze exactly where the application failed (e.g., compile at least ompi/mpi/ c/isend.c with the -g flag so that you can get some debugging information about exactly where in MPI_Isend it failed -- like I said, it's a short function that mainly checks parameters). You might want to have your application double check all the parameters that are passed to MPI_Isend, too.


On Oct 26, 2009, at 9:43 AM, Iris Pernille Lohmann wrote:

Dear list members

I am using openmpi 1.3.3 with OFED on a HP cluster with redhatLinux.

Occasionally (not always) I get a crash with the following message:

[hydra11:09312] *** Process received signal ***
[hydra11:09312] Signal: Segmentation fault (11)
[hydra11:09312] Signal code: Address not mapped (1)
[hydra11:09312] Failing at address: 0xffffffffab5f30a8
[hydra11:09312] [ 0] /lib64/libpthread.so.0 [0x3c1400e4c0]
[hydra11:09312] [ 1] /home/ipl/openmpi-1.3.3/platforms/hp/lib/ libmpi.so.0(MPI_Isend+0x93) [0x2af1be45a3e3]
[hydra11:09312] [ 2] ./flow(MP_SendReal+0x60) [0x6bc993]
[hydra11:09312] [ 3] ./flow(SendRealsAlongFaceWithOffset_3D+0x4ab) [0x68ba19]
[hydra11:09312] [ 4] ./flow(MP_SendVertexArrayBlock+0x23d) [0x6891e1]
[hydra11:09312] [ 5] ./flow(MB_CommAllVertex+0x65) [0x6848ba]
[hydra11:09312] [ 6] ./flow(MB_SetupVertexArray+0xd5) [0x68c837]
[hydra11:09312] [ 7] ./flow(MB_SetupGrid+0xa8) [0x68be51]
[hydra11:09312] [ 8] ./flow(SetGrid+0x58) [0x446224]
[hydra11:09312] [ 9] ./flow(main+0x148) [0x43b728]
[hydra11:09312] [10] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c1341d974]
[hydra11:09312] [11] ./flow(__gxx_personality_v0+0xd9) [0x429b19]
[hydra11:09312] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 9312 on node hydra11 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The crash does not appear always – sometimes the application runs fine. However, it seems that the crash especially occurs when I run on more than 1 node.

I have consulted the archive of open-mpi and have found many error messages of the same kind, but none from the 1.3.3 version, and none of direct relevance.

I would really appreciate comments on this. Below is the information required according to the openmpi web,

Config.log: attached (config.zip)
Open mpi was configured with prefix and with the path to openib, and with the following compiler flags
setenv CC gcc
setenv CFLAGS '-O'
setenv CXX g++
setenv CXXFLAGS '-O'
setenv F77 'gfortran'
setenv FFLAGS '-O'

ompi_info –all:
attached

The application (named flow) was launched on hydra11 by
nohup mpirun –H hydra11,hydra12 –np 8 ./flow caseC.in &

the PATH and LD_LIBRARY_PATH, hydra11 and hydra12:
PATH=/home/ipl/openmpi-1.3.3/platforms/hp/bin
LD_LIBRARY_PATH= /home/ipl/openmpi-1.3.3/platforms/hp/lib

OpenFabrics version: 1.4

Linux:
X86_64-redhat-linux/3.4.6

ibv_devinfo, hydra11: attached
ibv_devinfo, hydra12: attached

ifconfig, hydra11: attached
ifconfig, hydra12: attached

ulimit –l (hydra11): 6000000
ulimit –l (hydra12): unlimited

Furthermore, I can say that I have not specified any MCA parameters.

The application which I am running (named flow) is linked from fortran, c and c++ libraries with the following: /home/ipl/openmpi-1.3.3/platforms/hp/bin/mpicc -DMP - DNS3_ARCH_LINUX -DLAPACK -I/home/ipl/ns3/engine/include_forLinux -I/ home/ipl/openmpi-1.3.3/platforms/hp/include -c -o user_small_3D.o user_small_3D.c
rm -f flow
/home/ipl/openmpi-1.3.3/platforms/hp/bin/mpicxx -o flow user_small_3D.o -L/home/ipl/ns3/engine/lib_forLinux -lns3main - lns3pars -lns3util -lns3vofl -lns3turb -lns3solv -lns3mesh -lns3diff -lns3grid -lns3line -lns3data -lns3base -lfitpack -lillusolve - lfftpack_small -lfenton -lns3air -lns3dens -lns3poro -lns3sedi - llapack_small -lblas_small -lm -lgfortran /home/ipl/ns3/engine/ lib_Tecplot_forLinux/tecio64.a

Please let me know if you need more info!

Thanks in advance,
Iris Lohmann




Iris Pernille Lohmann
MSc, PhD
Ports & Offshore Technology (POT)

<image001.gif>

DHI
Agern Allé 5
DK-2970 Hørsholm
Denmark

Tel:

+45 4516 9200
Direct:

45169427

i...@dhigroup.com
www.dhigroup.com

WATER  •  ENVIRONMENT  •  HEALTH


*****************************************************************************
**                                                                         **
** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
**                                                                         **
** This warning was added by the IU Computer Science Dept. mail scanner. **
*****************************************************************************

< config .zip > < ompi_info_all .zip > < ibv_devinfo_hydra11 .out > < ibv_devinfo_hydra12 .out > < ifconfig_hydra11 .out ><ifconfig_hydra12.out>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com


Reply via email to