Jeff Squyres a écrit : > I believe that this was just fixed in OMPI v1.3.1 -- could you try > upgrading?
Yup, the issue is well solved. :) I would just want to add one thing. Isn't the current solution a little bit error prone. I mean, instead of having to check before each call to ORTE_UPDATE_EXIT_STATUS, whether the low 8 bits are indeed non-zero, wouldn't it be wiser to have ORTE_UPDATE_EXIT_STATUS do the check? > > On Mar 19, 2009, at 10:58 AM, Cristian KLEIN wrote: > >> Hello everybody, >> >> I've been using OpenMPI 1.3's mpirun in Makefiles and observed that the >> exit status is not always the one I expect. For example, using an >> incorrect machinefile makes mpirun return 0, whereas a non-zero value >> would be expected: >> >> --- cut here --- >> masternode:~/grid/myTests/hellompi$ env | grep OMPI >> OMPI_MCA_plm_rsh_agent=ssh >> OMPI_MCA_btl_tcp_if_exclude=lo,myri0 >> OMPI_MCA_btl=self,tcp >> >> masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile hostfile >> ./hellompi.openmpi; echo $? >> ssh: incorrecthost2.example.com: Name or service not known >> ssh: incorrecthost1.example.com: Name or service not known >> [snip] >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> >> mpirun: clean termination accomplished >> >> 0 >> --- end here --- >> >> The problem comes from the fact that the exitstatus of a process is ORed >> with 0xFF and OpenMPI does not take this into consideration. In my >> example, the exit status generated was 65280, which has the lower 8 bits >> zero. >> >> To solve this problem I suggest the attached patch. If the exitstatus >> would become zero, it replaces it with 111, where 111 is obviously a >> randomly chosen non-zero number. :D >> --- orte/runtime/orte_globals.h.orig 2009-01-09 18:55:22.000000000 >> +0100 >> +++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734 +0100 >> @@ -109,11 +109,14 @@ >> #define >> ORTE_UPDATE_EXIT_STATUS(newstatus) \ >> do >> { \ >> if (0 == orte_exit_status && 0 != newstatus) >> { \ >> + if ((newstatus & 0377) == >> 0) \ >> + orte_exit_status = >> 111; \ >> + >> else \ >> + orte_exit_status = >> newstatus; \ >> OPAL_OUTPUT_VERBOSE((1, >> orte_debug_output, \ >> "%s:%s(%d) updating exit status to >> %d", \ >> >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ >> - __FILE__, __LINE__, >> newstatus)); \ >> - orte_exit_status = >> newstatus; \ >> + __FILE__, __LINE__, >> orte_exit_status)); \ >> >> } \ >> } while(0); >> >> <ATT5772424.txt> > >