Hello Joe,

I have no solution, but the same problem, see
http://www.open-mpi.org/community/lists/users/2008/07/6007.php
There you will find a small program to demonstrate the problem.

I found that the problem does not exists on all hardware, I have the impression that the problem manifests itself on systems with 2 or more cores. I tried it on a single core machine, and there was no problem.

Regards,

Willem

Joe Landman wrote:
Hi folks:

   I am running into a strange problem with Open-MPI 1.2.6, built using
gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish).  The
problem appears to be that if I run using the tcp btl, disabling sm and
openib, the run completes successfully (on several different platforms),
and does so repeatably.

   Similarly, if I enable either openib or sm btl, the run does not
complete, hanging at different places.

   An strace of the master thread while it is hanging shows it in a
tight loop

Process 15547 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0

The code ran fine about 18 months ago with earlier OpenMPI.  This is
identical source and data to what is known to work, and demonstrated to
work on a few different platforms.

Posing the question on Beowulf, some suggested turning off sm and
openib.  So this run works repeatedly when we do as indicated.  The
suggestion was that there was some sort of buffer size issue on the sm
device.

Turning off sm and tcp, leaving openib also appears to loop forever.

So, with all this, are there any sort of tunables that I should be
playing with?

I tried adusting a few things  by setting some mca parameters in
$HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
claimed it was going to ignore those anyway).

Any clues?  Thanks.

Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Willem Vermin         tel (31)20 5923054/5923000
SARA, Kruislaan 415   fax (31)20 6683167
1098 SJ Amsterdam     wil...@sara.nl
Nederland

Reply via email to