Replying to my own post, I'd like to add some info: After making the master thread put more of a premium on receiving the missing messages, the problem went away. Both tasks now appear to keep up on the messages sent from the other. However, after about a minute and ~1.5e6 messages exchanged, both tasks segfault after printing the following error:
[sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress read an unknown type of header The debugger spits me out on line 674 of btl_sm_component.c, in the default of a switch on fragment type. There's a comment there that says: * This code path should presumably never be called. * It's unclear if it should exist or, if so, how it should be written. * If we want to return it to the sending process, * we have to figure out who the sender is. * It seems we need to subtract the mask bits. * Then, hopefully this is an sm header that has an smp_rank field. * Presumably that means the received header was relative. * Or, maybe this code should just be removed. That seems worrisome, like whoever wrote the code didn't know what was going on... I've gotten that error previously, but only when millions of outstanding messages had built up. Now, that's not the case. Does anyone have any idea what could be going on here? Thanks, /Patrik J.