maybe it's related to #1378  PML ob1 deadlock for ping/ping  ?

On 7/14/08, Jeff Squyres <jsquy...@cisco.com> wrote:
>
> What application is it?  The majority of the message passing engine did not
> change in the 1.2 series; we did add a new option into 1.2.6 for disabling
> early completion:
>
>
> http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
>
> See if that helps you out.
>
> Note that I don't think many (any?) of us developers monitor the beowulf
> list.  Too much mail in our INBOXes already... :-(
>
>
> On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:
>
>  Hi folks:
>>
>>  I am running into a strange problem with Open-MPI 1.2.6, built using
>> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish).  The problem
>> appears to be that if I run using the tcp btl, disabling sm and openib, the
>> run completes successfully (on several different platforms), and does so
>> repeatably.
>>
>>  Similarly, if I enable either openib or sm btl, the run does not
>> complete, hanging at different places.
>>
>>  An strace of the master thread while it is hanging shows it in a tight
>> loop
>>
>> Process 15547 attached - interrupt to quit
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>>
>> The code ran fine about 18 months ago with earlier OpenMPI.  This is
>> identical source and data to what is known to work, and demonstrated to work
>> on a few different platforms.
>>
>> Posing the question on Beowulf, some suggested turning off sm and openib.
>>  So this run works repeatedly when we do as indicated.  The suggestion was
>> that there was some sort of buffer size issue on the sm device.
>>
>> Turning off sm and tcp, leaving openib also appears to loop forever.
>>
>> So, with all this, are there any sort of tunables that I should be playing
>> with?
>>
>> I tried adusting a few things  by setting some mca parameters in
>> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
>> claimed it was going to ignore those anyway).
>>
>> Any clues?  Thanks.
>>
>> Joe
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: land...@scalableinformatics.com
>> web  : http://www.scalableinformatics.com
>>      http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax  : +1 866 888 3112
>> cell : +1 734 612 4615
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to