Hello, I am attempting to use the openmpi development master for a code that uses dynamic process management (i.e. MPI_Comm_spawn) on our Cray XC40 at the Army Research Laboratory. After reading through the mailing list I came to the conclusion that the master branch is the only hope for getting this to work on the newer Cray machines.
To test I am using the cpi-master.c cpi-worker.c example. The test works when executing on a small number of processors, five or less, but begins to fail with segmentation faults in orted when using more processors. Even with five or fewer processors, I am spreading the computation to more than one node. I am using the cray ugni btl through the alps scheduler. I get a core file from orted and have the seg fault tracked down to pmix_server_process_msgs.c:420 where req->proxy is NULL. I have tried reading the code to understand how this happens, but am unsure. I do see that in the if statement where I take the else branch, the other branch specifically checks "if (NULL == req->proxy)" - however, no such check is done the the else branch. I have debug output dumped for the failing runs. I can provide the output along with ompi_info output and config.log to anyone who is interested. - Ken Leiter
smime.p7s
Description: S/MIME cryptographic signature