Hi,
I have random segmentation violations (signal 11) in the mentioned
function when testing MPI I/O calls with 2 processes on a single
machine. Most of the time (1499/1500), it works perfectly.
here are the call stacks (for 1.6.3) on processes:
====================
process 0:
====================
#0 0x00000035374cf287 in sched_yield () from /lib64/libc.so.6
#1 0x00007ff73d158f4f in opal_progress () at runtime/opal_progress.c:220
#2 0x00007ff73d0a6fc5 in opal_condition_wait (count=2,
requests=0x7fffe3ef7ca0, statuses=0x7fffe3ef7c70) at
../opal/threads/condition.h:99
#3 ompi_request_default_wait_all (count=2, requests=0x7fffe3ef7ca0,
statuses=0x7fffe3ef7c70) at request/req_wait.c:263
#4 0x00007ff7348d365e in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0,
scount=0, sdatatype=0x7ff73d3c0cc0, dest=1, stag=-16, recvbuf=<value
optimized out>, rcount=0, rdatatype=0x7ff73d3c0cc0, source=1,
rtag=-16, comm=0x5c21a50, status=0x0) at coll_tuned_util.c:54
#5 0x00007ff7348db8ff in ompi_coll_tuned_barrier_intra_two_procs
(comm=<value optimized out>, module=<value optimized out>) at
coll_tuned_barrier.c:256
#6 0x00007ff73d0b42d2 in PMPI_Barrier (comm=0x5c21a50) at pbarrier.c:70
#7 0x00007ff7302a549c in mca_io_romio_dist_MPI_File_close
(mpi_fh=0x47d9e70) at close.c:62
#8 0x00007ff73d0a15fe in file_destructor (file=0x4d7b270) at
file/file.c:273
#9 0x00007ff73d0a1519 in opal_obj_run_destructors (file=0x7fffe3ef8bb0)
at ../opal/class/opal_object.h:448
#10 ompi_file_close (file=0x7fffe3ef8bb0) at file/file.c:146
#11 0x00007ff73d0ce868 in PMPI_File_close (fh=0x7fffe3ef8bb0) at
pfile_close.c:59
====================
process 1:
====================
...
#9 <signal handler called>
#10 0x00000035374784fd in _int_free () from /lib64/libc.so.6
#11 0x00007f37d777e493 in mca_io_romio_dist_MPI_File_close
(mpi_fh=0x4d41c90) at close.c:55
#12 0x00007f37e457a5fe in file_destructor (file=0x4dbc9b0) at
file/file.c:273
#13 0x00007f37e457a519 in opal_obj_run_destructors (file=0x7fff7c2c94b0)
at ../opal/class/opal_object.h:448
#14 ompi_file_close (file=0x7fff7c2c94b0) at file/file.c:146
#15 0x00007f37e45a7868 in PMPI_File_close (fh=0x7fff7c2c94b0) at
pfile_close.c:59
...
The problematic free is:
55 ADIOI_Free((fh)->shared_fp_fname);
Here are the values in the "fh" structure on both processes:
====================
process 0:
====================
{cookie = 2487376, fd_sys = 12, fd_direct = -1, direct_read = 53,
direct_write = 1697919538, d_mem = 3158059, d_miniosz = 1702127872,
fp_ind = 11, fp_sys_posn = -1, fns = 0x7ff7304b2280, comm = 0x5c21a50,
agg_comm = 0x7ff73d3d4120, is_open = 1, is_agg = 1,
filename = 0x4d103a0
"/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
file_system = 152, access_mode = 2, disp = 0, etype = 0x7ff73d3c0cc0,
filetype = 0x7ff73d3c0cc0, etype_size = 1, hints = 0x4cffde0, info =
0x5377610, split_coll_count = 0, split_status = {
MPI_SOURCE = 1681024372, MPI_TAG = 1919185519, MPI_ERROR =
1852388709, _cancelled = 1701994851, _ucount = 8389473197092726132},
split_datatype = 0x636f7270325f6972,
shared_fp_fname = 0x4d01810 "\330\376x75", shared_fp_fd = 0x0,
async_count = 0, perm = -1, atomicity = 0, fortran_handle = -1,
err_handler = 0x7ff73d3d55c0, fs_ptr = 0x0, file_realm_st_offs = 0x0,
file_realm_types = 0x0, my_cb_nodes_index = 0}
====================
process 1:
====================
print *fh
$4 = {cookie = 2487376, fd_sys = 12, fd_direct = -1, direct_read = 0,
direct_write = 1697919538, d_mem = 3158059, d_miniosz = 1702127872,
fp_ind = 11, fp_sys_posn = -1, fns = 0x7f37d798b280, comm = 0x4db8060,
agg_comm = 0x7f37e48ad120, is_open = 1, is_agg = 0,
filename = 0x4d52b30
"/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
file_system = 152, access_mode = 2, disp = 0, etype = 0x7f37e4899cc0,
filetype = 0x7f37e4899cc0, etype_size = 1, hints = 0x45c5250, info =
0x4d46750, split_coll_count = 0, split_status = {
MPI_SOURCE = 1681024372, MPI_TAG = 1919185519, MPI_ERROR =
1852388709, _cancelled = 1701994851, _ucount = 168}, split_datatype =
0x7f37e489b0c0,
shared_fp_fname = 0x4806e20
"/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
shared_fp_fd = 0x0, async_count = 0, perm = -1, atomicity = 0,
fortran_handle = -1, err_handler = 0x7f37e48ae5c0, fs_ptr = 0x0,
file_realm_st_offs = 0x0, file_realm_types = 0x0,
my_cb_nodes_index = -1}
For OpenMPI 1.6.5, I have also the problem occuring small number of times.
Here is the error, reported by gcc on process 1:
*** Error in
`/home/mefpp_ericc/GIREF/bin/Test.NormesEtProjectionChamp.dev': free():
invalid next size (normal): 0x000000000471cbc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7afc6)[0x7f1082edffc6]
/lib64/libc.so.6(+0x7bd43)[0x7f1082ee0d43]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(+0x630a1)[0x7f10847260a1]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(ompi_info_free+0x41)[0x7f10847264f1]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_Info_free+0x47)[0x7f108473fd17]
/opt/openmpi-1.6.5/lib64/openmpi/mca_io_romio.so(ADIO_Close+0x186)[0x7f107665f666]
/opt/openmpi-1.6.5/lib64/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_close+0xf3)[0x7f107667fde3]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(+0x60856)[0x7f1084723856]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(ompi_file_close+0x41)[0x7f1084723d71]
/opt/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_File_close+0x78)[0x7f1084750588]
What can be wrong? Is that fixed/changed in newer releases of OpenMPI?
Thanks,
Eric