What application is it? The majority of the message passing engine did not change in the 1.2 series; we did add a new option into 1.2.6 for disabling early completion:

    http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion

See if that helps you out.

Note that I don't think many (any?) of us developers monitor the beowulf list. Too much mail in our INBOXes already... :-(


On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:

Hi folks:

I am running into a strange problem with Open-MPI 1.2.6, built using gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1- ish). The problem appears to be that if I run using the tcp btl, disabling sm and openib, the run completes successfully (on several different platforms), and does so repeatably.

Similarly, if I enable either openib or sm btl, the run does not complete, hanging at different places.

An strace of the master thread while it is hanging shows it in a tight loop

Process 15547 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER| SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER| SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER| SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER| SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0

The code ran fine about 18 months ago with earlier OpenMPI. This is identical source and data to what is known to work, and demonstrated to work on a few different platforms.

Posing the question on Beowulf, some suggested turning off sm and openib. So this run works repeatedly when we do as indicated. The suggestion was that there was some sort of buffer size issue on the sm device.

Turning off sm and tcp, leaving openib also appears to loop forever.

So, with all this, are there any sort of tunables that I should be playing with?

I tried adusting a few things by setting some mca parameters in $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun claimed it was going to ignore those anyway).

Any clues?  Thanks.

Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
      http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to