We have here installed a couple of installations of OpenMPI 1.3.2, and
we are having real problems with single-node jobs randomly hanging when
using the shared memory BTL, particularly (but perhaps not only) when
using the version compiled with gcc 4.4.0.
The very trivial attached program, which just does a series of SENDRECVs
rightwards through MPI_COMM_WORLD, hangs extremely reliably when run
like so on an 8 core box:
mpirun -np 6 -mca btl self,sm ./diffusion-mpi
(the test example was based on a simple fortran example of MPIing the 1d
diffusion equation). The hanging seems to always occur within the
first 500 or so iterations - but sometimes between the 10th and 20th and
sometimes not until the late 400s. The hanging occurs both on a new
dual socket quad core nehalem box, and an older harpertown machine.
Running without sm, however, seems to work fine:
mpirun -np 6 -mca btl self,tcp ./diffusion-mpi
never gives any problems.
Any suggestions? I notice a mention of `improved flow control in sm' in
the ChangeLog to 1.3.3; is that a significant bugfix?
- Jonathan
--
Jonathan Dursi <ljdu...@scinet.utoronto.ca>
program diffuse
implicit none
include "mpif.h"
integer nsteps
parameter (nsteps = 150000)
integer step
real a,b
integer ierr
integer mpistatus(MPI_STATUS_SIZE)
integer nprocs,rank
integer leftneighbour, rightneighbour
integer tag
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
leftneighbour = rank-1
if (leftneighbour .eq. -1) then
leftneighbour = nprocs-1
endif
rightneighbour = rank+1
if (rightneighbour .eq. nprocs) then
rightneighbour = 0
endif
tag = 1
do step=1, nsteps
call MPI_SENDRECV(a,1,MPI_REAL,rightneighbour, &
& tag, &
& b, 1, MPI_REAL, leftneighbour, &
& tag, &
& MPI_COMM_WORLD, mpistatus, ierr)
if ((rank .eq. 0) .and. (mod(step,10) .eq. 1)) then
print *, 'Step = ', step
endif
enddo
call MPI_FINALIZE(ierr)
end