maybe it's related to #1378 PML ob1 deadlock for ping/ping ?
On 7/14/08, Jeff Squyres <jsquy...@cisco.com> wrote: > > What application is it? The majority of the message passing engine did not > change in the 1.2 series; we did add a new option into 1.2.6 for disabling > early completion: > > > http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion > > See if that helps you out. > > Note that I don't think many (any?) of us developers monitor the beowulf > list. Too much mail in our INBOXes already... :-( > > > On Jul 10, 2008, at 11:04 PM, Joe Landman wrote: > > Hi folks: >> >> I am running into a strange problem with Open-MPI 1.2.6, built using >> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The problem >> appears to be that if I run using the tcp btl, disabling sm and openib, the >> run completes successfully (on several different platforms), and does so >> repeatably. >> >> Similarly, if I enable either openib or sm btl, the run does not >> complete, hanging at different places. >> >> An strace of the master thread while it is hanging shows it in a tight >> loop >> >> Process 15547 attached - interrupt to quit >> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, >> 0x2b8d766be130}, NULL, 8) = 0 >> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0 >> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, >> {fd=8, events=POLLIN}, {fd=9, events= >> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0 >> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, >> 0x2b8d766be130}, NULL, 8) = 0 >> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, >> 0x2b8d766be130}, NULL, 8) = 0 >> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0 >> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, >> {fd=8, events=POLLIN}, {fd=9, events= >> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0 >> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, >> 0x2b8d766be130}, NULL, 8) = 0 >> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 >> >> The code ran fine about 18 months ago with earlier OpenMPI. This is >> identical source and data to what is known to work, and demonstrated to work >> on a few different platforms. >> >> Posing the question on Beowulf, some suggested turning off sm and openib. >> So this run works repeatedly when we do as indicated. The suggestion was >> that there was some sort of buffer size issue on the sm device. >> >> Turning off sm and tcp, leaving openib also appears to loop forever. >> >> So, with all this, are there any sort of tunables that I should be playing >> with? >> >> I tried adusting a few things by setting some mca parameters in >> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun >> claimed it was going to ignore those anyway). >> >> Any clues? Thanks. >> >> Joe >> -- >> Joseph Landman, Ph.D >> Founder and CEO >> Scalable Informatics LLC, >> email: land...@scalableinformatics.com >> web : http://www.scalableinformatics.com >> http://jackrabbit.scalableinformatics.com >> phone: +1 734 786 8423 >> fax : +1 866 888 3112 >> cell : +1 734 612 4615 >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >