Replying to my own post, I'd like to add some info:

After making the master thread put more of a premium on receiving the
missing messages, the problem went away. Both tasks now appear to keep
up on the messages sent from the other. However, after about a minute
and ~1.5e6 messages exchanged, both tasks segfault after printing the
following error:

[sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress
read an unknown type of header

The debugger spits me out on line 674 of btl_sm_component.c, in the
default of a switch on fragment type. There's a comment there that
says:

* This code path should presumably never be called.
* It's unclear if it should exist or, if so, how it should be written.
* If we want to return it to the sending process,
* we have to figure out who the sender is.
* It seems we need to subtract the mask bits.
* Then, hopefully this is an sm header that has an smp_rank field.
* Presumably that means the received header was relative.
* Or, maybe this code should just be removed.

That seems worrisome, like whoever wrote the code didn't know what was
going on... I've gotten that error previously, but only when millions
of outstanding messages had built up. Now, that's not the case.

Does anyone have any idea what could be going on here?

Thanks,

/Patrik J.

Reply via email to