Hi folks:

I am running into a strange problem with Open-MPI 1.2.6, built using gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The problem appears to be that if I run using the tcp btl, disabling sm and openib, the run completes successfully (on several different platforms), and does so repeatably.

Similarly, if I enable either openib or sm btl, the run does not complete, hanging at different places.

An strace of the master thread while it is hanging shows it in a tight loop

Process 15547 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0

The code ran fine about 18 months ago with earlier OpenMPI. This is identical source and data to what is known to work, and demonstrated to work on a few different platforms.

Posing the question on Beowulf, some suggested turning off sm and openib. So this run works repeatedly when we do as indicated. The suggestion was that there was some sort of buffer size issue on the sm device.

Turning off sm and tcp, leaving openib also appears to loop forever.

So, with all this, are there any sort of tunables that I should be playing with?

I tried adusting a few things by setting some mca parameters in $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun claimed it was going to ignore those anyway).

Any clues?  Thanks.

Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

Reply via email to