Hello, I think that my problem: http://www.open-mpi.org/community/lists/users/2012/05/19182.php is similar to yours. Following the advice in the thread that you posted: http://www.open-mpi.org/community/lists/users/2011/07/16996.php I have tried to run my program adding: -mca btl_openib_flags 305 to the mpirun command and... it passes the point where it hanged all the other times :D !!!!
Now I will try to understand why it is happening and what those flags means but before I wanted to share this with you, just in case it could help you too. Good luck and thank you and Brock for your help, Jorge On Thu, 2012-05-03 at 23:01 -0700, Martin Siegert wrote: > On Tue, Apr 24, 2012 at 04:19:31PM -0400, Brock Palen wrote: > > To throw in my $0.02, though it is worth less. > > > > Were you running this on verb based infiniband? > > Correct. > > > We see a problem that we have a work around for even with the newest 1.4.5 > > only on IB, we can reproduce it with IMB. > > I can now confirm that the program hangs with 1.4.5 as well at exactly the > same > point. > Any chance that this has to do with the default settings for the > btl_openib_max_eager_rdma and mpi_leave_pinned mca parameters? I.e., > should I try to run the program with > --mca btl_openib_max_eager_rdma 0 --mca mpi_leave_pinned 0 > > > You can find an old thread from me about it. Your problem might not be the > > same. > > > > Brock Palen > > www.umich.edu/~brockp > > CAEN Advanced Computing > > bro...@umich.edu > > (734)936-1985 > > This one? > http://www.open-mpi.org/community/lists/users/2011/07/16996.php > > - Martin > > > On Apr 24, 2012, at 3:09 PM, Jeffrey Squyres wrote: > > > > > Could you repeat your tests with 1.4.5 and/or 1.5.5? > > > > > > > > > On Apr 23, 2012, at 1:32 PM, Martin Siegert wrote: > > > > > >> Hi, > > >> > > >> I am debugging a program that hangs in MPI_Allreduce (openmpi-1.4.3). > > >> An strace of one of the processes shows: > > >> > > >> Process 10925 attached with 3 threads - interrupt to quit > > >> [pid 10927] poll([{fd=17, events=POLLIN}, {fd=16, events=POLLIN}], 2, -1 > > >> <unfini > > >> shed ...> > > >> [pid 10926] select(15, [8 14], [], NULL, NULL <unfinished ...> > > >> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > > >> events=PO > > >> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 > > >> (Timeout) > > >> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > > >> events=PO > > >> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 > > >> (Timeout) > > >> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > > >> events=PO > > >> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 > > >> (Timeout) > > >> [pid 10925] poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, > > >> events=PO > > >> LLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}], 5, 0) = 0 > > >> (Timeout) > > >> ... > > >> > > >> The program is a Fortran program using 64bit integers (compiled with -i8) > > >> and I correspondingly compiled openmpi (version 1.4.3) with -i8 for > > >> the Fortran compiler as well. > > >> > > >> The program is somewhat difficult to debug since it takes 3 days to reach > > >> the point where it hangs. This is what I found so far: > > >> > > >> MPI_Allreduce is called as > > >> > > >> call MPI_Allreduce(MPI_IN_PLACE, recvbuf, count, MPI_DOUBLE_PRECISION, & > > >> MPI_SUM, MPI_COMM_WORLD, mpierr) > > >> > > >> with count = 455295488. Since the Fortran interface just calls the > > >> C routines in OpenMPI and count variables are 32bit integers in C I > > >> started > > >> to wonder what is the largest integer "count" for which a MPI_Allreduce > > >> succeeds. E.g., in MPICH (it has been a while that I looked into this, > > >> i.e., > > >> this may or may not be correct anymore) all send/recv were converted > > >> into send/recv of MPI_BYTE, thus the largest count for doubles was > > >> (2^31-1)/8 = 268435455. Thus, I started to wrap the MPI_Allreduce call > > >> with a myMPI_Allreduce routine that repeatedly calls MPI_Allreduce when > > >> the count is larger than some value maxallreduce (the myMPI_Allreduce.f90 > > >> is attached). I have tested the routine with a trivial program that > > >> just fills an array with numbers and calls myMPI_Allreduce and this > > >> test succeeds. > > >> However, with the real program the situations is very strange: > > >> When I set maxallreduce = 268435456, the program hangs at the first call > > >> (iallreduce = 1) to MPI_Allreduce in the do loop > > >> > > >> do iallreduce = 1, nallreduce - 1 > > >> idx = (iallreduce - 1)*length + 1 > > >> call MPI_Allreduce(MPI_IN_PLACE, recvbuf(idx), length, & > > >> datatype, op, comm, mpierr) > > >> if (mpierr /= MPI_SUCCESS) return > > >> end do > > >> > > >> With maxallreduce = 134217728 the first call succeeds, the second hangs. > > >> For maxallreduce = 67108864, the first two calls to MPI_Allreduce > > >> complete, > > >> but the third (iallreduce = 3) hangs. For maxallreduce = 8388608 the > > >> 17th call hangs, for 1048576 the 138th call hangs; here is a table > > >> (values from gdb attached to process 0 when the program hangs): > > >> > > >> maxallreduce iallreduce idx length > > >> 268435456 1 1 227647744 > > >> 134217728 2 113823873 113823872 > > >> 67108864 3 130084427 65042213 > > >> 8388608 17 137447697 8590481 > > >> 1048576 138 143392010 1046657 > > >> > > >> As if there is (are) some element(s) in the middle of the array with > > >> idx >= 143392010 that cannot be sent or recv'd. > > >> > > >> Has anybody seen this kind of behaviour? > > >> Has anybody an idea what could be causing this? > > >> Ideas how to get around this? > > >> Anything that could help would be appreciated ... I already spent a > > >> huge amount of time on this and I am running out of ideas. > > >> > > >> Cheers, > > >> Martin > > >> > > >> -- > > >> Martin Siegert > > >> Simon Fraser University > > >> Burnaby, British Columbia > > >> Canada > > >> <myMPI_Allreduce.f90>_______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > Martin Siegert > Head, Research Computing > WestGrid/ComputeCanada Site Lead > IT Services phone: 778 782-4691 > Simon Fraser University fax: 778 782-4242 > Burnaby, British Columbia email: sieg...@sfu.ca > Canada V5A 1S6 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net.