On May 26, 2008, at 5:17 PM, Matt Hughes wrote:
With the TCP btl, when free list items are exhausted, OMPI 1.2.6 falls
into an infinite loop:
#3981 0x0000002a98b4e23f in opal_condition_wait (c=0x2a98c541d0,
m=0x2a98c54180) at ../../../../opal/threads/condition.h:81
[snip]
Yoinks.
The call used to get a free list item is OMPI_FREE_LIST_WAIT(), which
is supposed to block until an item is available. However, it calls
opal_condition_wait(), which in turn calls opal_process(), which then
waits for a free list item..... It seems strange to me that
opal_condition_wait() calls opal_progress(), but I'm not that familiar
with the code.
We do that because OMPI is single-threaded. Otherwise, there's no
other way to make progress while waiting for the conditional variable
to become true.
Is it possible that this has been fixed in 1.3?
It is possible -- there were some changes with regards to how free
list waiting was done, etc. Would it be possible to try your test
with a trunk nightly tarball?
http://www.open-mpi.org/nightly/trunk/
I haven't tried 1.3 yet because I will have to file a truckload of
bugs against 1.3 first.
Do you have a truckload of bugs to file for v1.3? If so, now is the
time to do so -- we're gearing up for the v1.3 release...
Should I be posting this stuff to the devel list?
If your questions go beyond the naieve-user-level questions, you might
get a quicker response on the devel list.
--
Jeff Squyres
Cisco Systems