Dear MPI users,

using Valgrind I found that the possibile error (which leads to segfault or 
hanging) comes from:


==10334== Conditional jump or move depends on uninitialised value(s)
==10334==    at 0xB150740: btl_openib_handle_incoming 
(btl_openib_component.c:2888)
==10334==    by 0xB1525A2: handle_wc (btl_openib_component.c:3189)
==10334==    by 0xB150390: btl_openib_component_progress 
(btl_openib_component.c:3462)
==10334==    by 0x581DDD6: opal_progress (opal_progress.c:207)
==10334==    by 0x52A75DE: ompi_request_default_wait_any (req_wait.c:154)
==10334==    by 0x52ED449: PMPI_Waitany (pwaitany.c:70)
==10334==    by 0x50541BF: MPI_WAITANY (pwaitany_f.c:86)
==10334==    by 0x4ECCC1: mpiwaitany_ (parallelutils.f:1374)
==10334==    by 0x4ECB18: waitanymessages_ (parallelutils.f:1295)
==10334==    by 0x484249: cutman_v_ (grid.f:490)
==10334==    by 0x40DE62: MAIN__ (cosa.f:379)
==10334==    by 0x40BEFB: main (in 
/work/ady/fsalvado/CAMPOBASSO/CASPUR_MPI/4_MPI/crashtest-valgrind/cosa.mpi)
==10334==
==10334== Use of uninitialised value of size 8
==10334==    at 0xB150764: btl_openib_handle_incoming 
(btl_openib_component.c:2892)
==10334==    by 0xB1525A2: handle_wc (btl_openib_component.c:3189)
==10334==    by 0xB150390: btl_openib_component_progress 
(btl_openib_component.c:3462)
==10334==    by 0x581DDD6: opal_progress (opal_progress.c:207)
==10334==    by 0x52A75DE: ompi_request_default_wait_any (req_wait.c:154)
==10334==    by 0x52ED449: PMPI_Waitany (pwaitany.c:70)
==10334==    by 0x50541BF: MPI_WAITANY (pwaitany_f.c:86)
==10334==    by 0x4ECCC1: mpiwaitany_ (parallelutils.f:1374)
==10334==    by 0x4ECB18: waitanymessages_ (parallelutils.f:1295)
==10334==    by 0x484249: cutman_v_ (grid.f:490)
==10334==    by 0x40DE62: MAIN__ (cosa.f:379)
==10334==    by 0x40BEFB: main (in 
/work/ady/fsalvado/CAMPOBASSO/CASPUR_MPI/4_MPI/crashtest-valgrind/cosa.mpi)

valgrind complains even without using eager_rdma (while the code seems to work 
in such a case) but complains much less using tcp/ip. there are many other 
valgrind warning after these and I can send the complete valgrind output if 
needed.

the messages recall something from another thread

http://www.open-mpi.org/community/lists/users/2010/09/14324.php

which, however, concluded without any direct solution.

can anyone help me in identifying the source of the bug (code or MPI bug)?

thanks
Francesco
________________________________
From: Francesco Salvadore <francescosalvad...@yahoo.com>
To: "us...@open-mpi.org" <us...@open-mpi.org>
Sent: Saturday, October 8, 2011 10:06 AM
Subject: [OMPI users] MPI_Waitany segfaults or (maybe) hangs


Dear MPI users, 

I am struggling against the bad behaviour of a MPI code. These are the 
basic informations: 

a) fortran intel11 or intel 12 and OpenMPI 1.4.1 and 1.4.3 give the same 
problem. activating -traceback compiler option, I see the program stops 
at MPI_Waitany. MPI_Waitany waits for the completion of an array of 
MPI_IRecv: looping for the number of array components at the end all 
receives should be completed. 
The programs stops at unpredictable points (after 1 or 5 or 24 hours of 
computation). Sometimes I have sigsegv: 

mca_btl_openib.so  00002BA74D29D181  Unknown               Unknown  Unknown 
mca_btl_openib.so  00002BA74D29C6FF  Unknown               Unknown  Unknown 
mca_btl_openib.so  00002BA74D29C033  Unknown               Unknown  Unknown 
libopen-pal.so.0   00002BA74835C3E6  Unknown               Unknown  Unknown 
libmpi.so.0        00002BA747E485AD  Unknown               Unknown  Unknown 
libmpi.so.0        00002BA747E7857D  Unknown               Unknown  Unknown 
libmpi_f77.so.0    00002BA747C047C4  Unknown               Unknown  Unknown 
cosa.mpi           00000000004F856B  waitanymessages_         1292  
parallelutils.f 
cosa.mpi           00000000004C8044  cutman_q_                2084  bc.f 
cosa.mpi           0000000000413369  smooth_                  2029  cosa.f 
cosa.mpi           0000000000410782  mg_                       810  cosa.f 
cosa.mpi           000000000040FB78  MAIN__                    537  cosa.f 
cosa.mpi           000000000040C1FC  Unknown               Unknown  Unknown 
libc.so.6          00002BA7490AE994  Unknown               Unknown  Unknown 
cosa.mpi           000000000040C109  Unknown               Unknown  Unknown 
-------------------------------------------------------------------------- 
mpirun has exited due to process rank 34 with PID 10335 on 
node neo251 exiting without calling "finalize". This may 
have caused other processes in the application to be 
terminated by signals sent by mpirun (as reported here). 
-------------------------------------------------------------------------- 

Waitanymessages is just a wrapper of MPI_Waitany. Sometimes, the run 
stops writing anything on screen and I do not know what is happening 
(probably MPI_Waitany hangs). Before reaching segafault or hanging, 
results are always correct, as checked using the serial version of the 
code. 

b) The problem occurs only using openib (using TCP/IP it works) and only 
using more than one node on our main cluster . Trying many possibile 
workarounds, I found that running using: 

-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca 
btl_openib_flags 1 

the problems seems not to occur. 

I would be very thankful to anyone which can help me to make us sure 
there is no bug in the code and, anyway, to discover the reason of such 
a "dangerous" behaviour. 

I can give any further information if needed, and I apologize if the 
post is not enough clear or complete. 

regards, 
Francesco 
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to