Hello everybody, I've been using OpenMPI 1.3's mpirun in Makefiles and observed that the exit status is not always the one I expect. For example, using an incorrect machinefile makes mpirun return 0, whereas a non-zero value would be expected:
--- cut here --- masternode:~/grid/myTests/hellompi$ env | grep OMPI OMPI_MCA_plm_rsh_agent=ssh OMPI_MCA_btl_tcp_if_exclude=lo,myri0 OMPI_MCA_btl=self,tcp masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile hostfile ./hellompi.openmpi; echo $? ssh: incorrecthost2.example.com: Name or service not known ssh: incorrecthost1.example.com: Name or service not known [snip] mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished 0 --- end here --- The problem comes from the fact that the exitstatus of a process is ORed with 0xFF and OpenMPI does not take this into consideration. In my example, the exit status generated was 65280, which has the lower 8 bits zero. To solve this problem I suggest the attached patch. If the exitstatus would become zero, it replaces it with 111, where 111 is obviously a randomly chosen non-zero number. :D
--- orte/runtime/orte_globals.h.orig 2009-01-09 18:55:22.000000000 +0100 +++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734 +0100 @@ -109,11 +109,14 @@ #define ORTE_UPDATE_EXIT_STATUS(newstatus) \ do { \ if (0 == orte_exit_status && 0 != newstatus) { \ + if ((newstatus & 0377) == 0) \ + orte_exit_status = 111; \ + else \ + orte_exit_status = newstatus; \ OPAL_OUTPUT_VERBOSE((1, orte_debug_output, \ "%s:%s(%d) updating exit status to %d", \ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ - __FILE__, __LINE__, newstatus)); \ - orte_exit_status = newstatus; \ + __FILE__, __LINE__, orte_exit_status)); \ } \ } while(0);