Hi all, This question was buried in an earlier question, and I got no replies, so I'll try reposting it with a more enticing subject.
I have a multithreaded openmpi code where each task has N+1 threads, the N threads send nonblocking messages that are received by the 1 thread on the other tasks. When I run this code with 2 tasks, 5+1 threads on a single node with 12 cores, after about a million messages has been exchanged, the tasks segfault after printing the following error: [sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress read an unknown type of header The debugger spits me out on line 674 of btl_sm_component.c, in the default of a switch on fragment type. There's a comment there that says: * This code path should presumably never be called. * It's unclear if it should exist or, if so, how it should be written. * If we want to return it to the sending process, * we have to figure out who the sender is. * It seems we need to subtract the mask bits. * Then, hopefully this is an sm header that has an smp_rank field. * Presumably that means the received header was relative. * Or, maybe this code should just be removed. It seems like whoever wrote that code didn't know quite what was going on, and I guess the assumption was wrong because dereferencing that result seems to be what's causing the segfault. Does anyone here know what could cause this error? If I run the code with the tcp btl instead of sm, it runs fine, albeit with a bit lower performance. This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel 12.3.174. I've attached the ompi_info output. Thanks, /Patrik J.
ompi_info.gz
Description: GNU Zip compressed data