With the TCP btl, when free list items are exhausted, OMPI 1.2.6 falls
into an infinite loop:

#3981 0x0000002a98b4e23f in opal_condition_wait (c=0x2a98c541d0,
    m=0x2a98c54180) at ../../../../opal/threads/condition.h:81
#3982 0x0000002a98b4e0e9 in __ompi_free_list_wait (fl=0x2a98c540d0,
    item=0x7fa82af630) at ../../../../ompi/class/ompi_free_list.h:187
#3983 0x0000002a98b4dbd4 in mca_btl_tcp_endpoint_recv_handler (sd=18, flags=2,
    user=0xc20240) at btl_tcp_endpoint.c:611
#3984 0x0000002a95bf78de in opal_event_process_active (base=0xb81390)
    at event.c:464
#3985 0x0000002a95bf7c0a in opal_event_base_loop (base=0xb81390, flags=2)
    at event.c:603
#3986 0x0000002a95bf79c7 in opal_event_loop (flags=2) at event.c:517
#3987 0x0000002a95bf2227 in opal_progress () at runtime/opal_progress.c:259
#3988 0x0000002a98b4e23f in opal_condition_wait (c=0x2a98c541d0,
    m=0x2a98c54180) at ../../../../opal/threads/condition.h:81
#3989 0x0000002a98b4e0e9 in __ompi_free_list_wait (fl=0x2a98c540d0,
    item=0x7fa82af7f0) at ../../../../ompi/class/ompi_free_list.h:187
#3990 0x0000002a98b4dbd4 in mca_btl_tcp_endpoint_recv_handler (sd=22, flags=2,
    user=0xc2dcf0) at btl_tcp_endpoint.c:611
#3991 0x0000002a95bf78de in opal_event_process_active (base=0xb81390)
    at event.c:464
#3992 0x0000002a95bf7c0a in opal_event_base_loop (base=0xb81390, flags=2)
    at event.c:603
#3993 0x0000002a95bf79c7 in opal_event_loop (flags=2) at event.c:517
#3994 0x0000002a95bf2227 in opal_progress () at runtime/opal_progress.c:259
#3995 0x0000002a98b4e23f in opal_condition_wait (c=0x2a98c541d0,
    m=0x2a98c54180) at ../../../../opal/threads/condition.h:81

The call used to get a free list item is OMPI_FREE_LIST_WAIT(), which
is supposed to block until an item is available.  However, it calls
opal_condition_wait(), which in turn calls opal_process(), which then
waits for a free list item.....  It seems strange to me that
opal_condition_wait() calls opal_progress(), but I'm not that familiar
with the code.

Is it possible that this has been fixed in 1.3?

I haven't tried 1.3 yet because I will have to file a truckload of
bugs against 1.3 first.

Should I be posting this stuff to the devel list?

Thanks,
mch

Reply via email to