What application is it? The majority of the message passing engine
did not change in the 1.2 series; we did add a new option into 1.2.6
for disabling early completion:
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
See if that helps you out.
Note that I don't think many (any?) of us developers monitor the
beowulf list. Too much mail in our INBOXes already... :-(
On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:
Hi folks:
I am running into a strange problem with Open-MPI 1.2.6, built
using gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-
ish). The problem appears to be that if I run using the tcp btl,
disabling sm and openib, the run completes successfully (on several
different platforms), and does so repeatably.
Similarly, if I enable either openib or sm btl, the run does not
complete, hanging at different places.
An strace of the master thread while it is hanging shows it in a
tight loop
Process 15547 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
The code ran fine about 18 months ago with earlier OpenMPI. This is
identical source and data to what is known to work, and demonstrated
to work on a few different platforms.
Posing the question on Beowulf, some suggested turning off sm and
openib. So this run works repeatedly when we do as indicated. The
suggestion was that there was some sort of buffer size issue on the
sm device.
Turning off sm and tcp, leaving openib also appears to loop forever.
So, with all this, are there any sort of tunables that I should be
playing with?
I tried adusting a few things by setting some mca parameters in
$HOME/.openmpi/mca-params.conf , but this had no effect (and the
mpirun claimed it was going to ignore those anyway).
Any clues? Thanks.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems