Could you repeat your tests with 1.4.5 and/or 1.5.5?
On Apr 23, 2012, at 1:32 PM, Martin Siegert wrote: > Hi, > > I am debugging a program that hangs in MPI_Allreduce (openmpi-1.4.3). > An strace of one of the processes shows: > > Process 10925 attached with 3 threads - interrupt to quit > [pid 10927] poll([{fd=17, events=POLLIN}, {fd=16, events=POLLIN}], 2, -1 > <unfini > shed ...> > [pid 10926] select(15, [8 14], [], NULL, NULL <unfinished ...> > [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > events=PO > LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout) > [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > events=PO > LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout) > [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > events=PO > LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout) > [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > events=PO > LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout) > ... > > The program is a Fortran program using 64bit integers (compiled with -i8) > and I correspondingly compiled openmpi (version 1.4.3) with -i8 for > the Fortran compiler as well. > > The program is somewhat difficult to debug since it takes 3 days to reach > the point where it hangs. This is what I found so far: > > MPI_Allreduce is called as > > call MPI_Allreduce(MPI_IN_PLACE, recvbuf, count, MPI_DOUBLE_PRECISION, & > MPI_SUM, MPI_COMM_WORLD, mpierr) > > with count = 455295488. Since the Fortran interface just calls the > C routines in OpenMPI and count variables are 32bit integers in C I started > to wonder what is the largest integer "count" for which a MPI_Allreduce > succeeds. E.g., in MPICH (it has been a while that I looked into this, i.e., > this may or may not be correct anymore) all send/recv were converted > into send/recv of MPI_BYTE, thus the largest count for doubles was > (2^31-1)/8 = 268435455. Thus, I started to wrap the MPI_Allreduce call > with a myMPI_Allreduce routine that repeatedly calls MPI_Allreduce when > the count is larger than some value maxallreduce (the myMPI_Allreduce.f90 > is attached). I have tested the routine with a trivial program that > just fills an array with numbers and calls myMPI_Allreduce and this > test succeeds. > However, with the real program the situations is very strange: > When I set maxallreduce = 268435456, the program hangs at the first call > (iallreduce = 1) to MPI_Allreduce in the do loop > > do iallreduce = 1, nallreduce - 1 > idx = (iallreduce - 1)*length + 1 > call MPI_Allreduce(MPI_IN_PLACE, recvbuf(idx), length, & > datatype, op, comm, mpierr) > if (mpierr /= MPI_SUCCESS) return > end do > > With maxallreduce = 134217728 the first call succeeds, the second hangs. > For maxallreduce = 67108864, the first two calls to MPI_Allreduce complete, > but the third (iallreduce = 3) hangs. For maxallreduce = 8388608 the > 17th call hangs, for 1048576 the 138th call hangs; here is a table > (values from gdb attached to process 0 when the program hangs): > > maxallreduce iallreduce idx length > 268435456 1 1 227647744 > 134217728 2 113823873 113823872 > 67108864 3 130084427 65042213 > 8388608 17 137447697 8590481 > 1048576 138 143392010 1046657 > > As if there is (are) some element(s) in the middle of the array with > idx >= 143392010 that cannot be sent or recv'd. > > Has anybody seen this kind of behaviour? > Has anybody an idea what could be causing this? > Ideas how to get around this? > Anything that could help would be appreciated ... I already spent a > huge amount of time on this and I am running out of ideas. > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University > Burnaby, British Columbia > Canada > <myMPI_Allreduce.f90>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/