Could you repeat your tests with 1.4.5 and/or 1.5.5?

On Apr 23, 2012, at 1:32 PM, Martin Siegert wrote:

> Hi,
> 
> I am debugging a program that hangs in MPI_Allreduce (openmpi-1.4.3).
> An strace of one of the processes shows:
> 
> Process 10925 attached with 3 threads - interrupt to quit
> [pid 10927] poll([{fd=17, events=POLLIN}, {fd=16, events=POLLIN}], 2, -1 
> <unfini
> shed ...>
> [pid 10926] select(15, [8 14], [], NULL, NULL <unfinished ...>
> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=PO
> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout)
> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=PO
> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout)
> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=PO
> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout)
> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
> events=PO
> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 (Timeout)
> ...
> 
> The program is a Fortran program using 64bit integers (compiled with -i8)
> and I correspondingly compiled openmpi (version 1.4.3) with -i8 for
> the Fortran compiler as well.
> 
> The program is somewhat difficult to debug since it takes 3 days to reach
> the point where it hangs. This is what I found so far:
> 
> MPI_Allreduce is called as
> 
> call MPI_Allreduce(MPI_IN_PLACE, recvbuf, count, MPI_DOUBLE_PRECISION, &
>                   MPI_SUM, MPI_COMM_WORLD, mpierr)
> 
> with count = 455295488. Since the Fortran interface just calls the
> C routines in OpenMPI and count variables are 32bit integers in C I started
> to wonder what is the largest integer "count" for which a MPI_Allreduce
> succeeds. E.g., in MPICH (it has been a while that I looked into this, i.e.,
> this may or may not be correct anymore) all send/recv were converted
> into send/recv of MPI_BYTE, thus the largest count for doubles was
> (2^31-1)/8 = 268435455. Thus, I started to wrap the MPI_Allreduce call
> with a myMPI_Allreduce routine that repeatedly calls MPI_Allreduce when
> the count is larger than some value maxallreduce (the myMPI_Allreduce.f90
> is attached). I have tested the routine with a trivial program that
> just fills an array with numbers and calls myMPI_Allreduce and this
> test succeeds.
> However, with the real program the situations is very strange:
> When I set maxallreduce = 268435456, the program hangs at the first call
> (iallreduce = 1) to MPI_Allreduce in the do loop
> 
>         do iallreduce = 1, nallreduce - 1
>            idx = (iallreduce - 1)*length + 1
>            call MPI_Allreduce(MPI_IN_PLACE, recvbuf(idx), length, &
>                               datatype, op, comm, mpierr)
>            if (mpierr /= MPI_SUCCESS) return
>         end do
> 
> With maxallreduce = 134217728 the first call succeeds, the second hangs. 
> For maxallreduce = 67108864, the first two calls to MPI_Allreduce complete, 
> but the third (iallreduce = 3) hangs. For maxallreduce = 8388608 the
> 17th call hangs, for 1048576 the 138th call hangs; here is a table 
> (values from gdb attached to process 0 when the program hangs):
> 
> maxallreduce iallreduce         idx        length
> 268435456             1           1     227647744
> 134217728             2   113823873     113823872
> 67108864             3   130084427      65042213
>  8388608            17   137447697       8590481
>  1048576           138   143392010       1046657
> 
> As if there is (are) some element(s) in the middle of the array with 
> idx >= 143392010 that cannot be sent or recv'd.
> 
> Has anybody seen this kind of behaviour?
> Has anybody an idea what could be causing this?
> Ideas how to get around this?
> Anything that could help would be appreciated ... I already spent a
> huge amount of time on this and I am running out of ideas.
> 
> Cheers,
> Martin
> 
> -- 
> Martin Siegert
> Simon Fraser University
> Burnaby, British Columbia
> Canada
> <myMPI_Allreduce.f90>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to