I am seeing mpi_allreduce operations freeze execution of my code on some moderately-sized problems. The freeze does not manifest itself in every problem. In addition, it is in a portion of the code that is repeated many times. In the problem discussed below, the problem appears in the 60th iteration.
The current test case that I'm looking at is a 64-processor job. This particular mpi_allreduce call applies to all 64 processors, with each communicator in the call containing a total of 4 processors. When I add print statements before and after the offending line, I see that all 64 processors successfully make it to the mpi_allreduce call, but only 32 successfully exit. Stack traces on the other 32 yield something along the lines of the trace listed at the bottom of this message. The call, itself, looks like: call mpi_allreduce(MPI_IN_PLACE, phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), & phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) These messages are sized to remain under the 32-bit integer size limitation for the "count" parameter. The intent is to perform the allreduce operation on a contiguous block of the array. Previously, I had been passing an assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation indicating that was potentially dangerous. Making the change from assumed- to explicit-shaped arrays doesn't solve the problem. However, if I declare an additional array and use separate send and receive buffers: call mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr) phim(:,:,:,:,grp) = phim_global Then the problem goes away, and every thing works normally. Does anyone have any insight as to what may be happening here? I'm using "include 'mpif.h'" rather than the f90 module, does that potentially explain this? Thanks, Greg Stack trace(s) for thread: 1 ----------------- [0] (1 processes) ----------------- main() at ?:? solver() at solver.f90:31 solver_q_down() at solver_q_down.f90:52 iter() at iter.f90:56 mcalc() at mcalc.f90:38 pmpi_allreduce__() at ?:? PMPI_Allreduce() at ?:? ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:? ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:? ompi_coll_tuned_sendrecv_actual() at ?:? ompi_request_default_wait_all() at ?:? opal_progress() at ?:? Stack trace(s) for thread: 2 ----------------- [0] (1 processes) ----------------- start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? Stack trace(s) for thread: 3 ----------------- [0] (1 processes) ----------------- start_thread() at ?:? service_thread_start() at ?:? select() at ?:?