On May 26, 2008, at 5:17 PM, Matt Hughes wrote:

With the TCP btl, when free list items are exhausted, OMPI 1.2.6 falls
into an infinite loop:

#3981 0x0000002a98b4e23f in opal_condition_wait (c=0x2a98c541d0,
   m=0x2a98c54180) at ../../../../opal/threads/condition.h:81

[snip]

Yoinks.

The call used to get a free list item is OMPI_FREE_LIST_WAIT(), which
is supposed to block until an item is available.  However, it calls
opal_condition_wait(), which in turn calls opal_process(), which then
waits for a free list item.....  It seems strange to me that
opal_condition_wait() calls opal_progress(), but I'm not that familiar
with the code.

We do that because OMPI is single-threaded. Otherwise, there's no other way to make progress while waiting for the conditional variable to become true.

Is it possible that this has been fixed in 1.3?

It is possible -- there were some changes with regards to how free list waiting was done, etc. Would it be possible to try your test with a trunk nightly tarball?

    http://www.open-mpi.org/nightly/trunk/

I haven't tried 1.3 yet because I will have to file a truckload of
bugs against 1.3 first.

Do you have a truckload of bugs to file for v1.3? If so, now is the time to do so -- we're gearing up for the v1.3 release...

Should I be posting this stuff to the devel list?


If your questions go beyond the naieve-user-level questions, you might get a quicker response on the devel list.

--
Jeff Squyres
Cisco Systems

Reply via email to