Hello everybody,
I've been using OpenMPI 1.3's mpirun in Makefiles and observed
that the
exit status is not always the one I expect. For example, using an
incorrect machinefile makes mpirun return 0, whereas a non-zero
value
would be expected:
--- cut here ---
masternode:~/grid/myTests/hellompi$ env | grep OMPI
OMPI_MCA_plm_rsh_agent=ssh
OMPI_MCA_btl_tcp_if_exclude=lo,myri0
OMPI_MCA_btl=self,tcp
masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile
hostfile
./hellompi.openmpi; echo $?
ssh: incorrecthost2.example.com: Name or service not known
ssh: incorrecthost1.example.com: Name or service not known
[snip]
mpirun noticed that the job aborted, but has no info as to the
process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
0
--- end here ---
The problem comes from the fact that the exitstatus of a process
is ORed
with 0xFF and OpenMPI does not take this into consideration. In my
example, the exit status generated was 65280, which has the lower
8 bits
zero.
To solve this problem I suggest the attached patch. If the
exitstatus
would become zero, it replaces it with 111, where 111 is obviously a
randomly chosen non-zero number. :D
--- orte/runtime/orte_globals.h.orig 2009-01-09
18:55:22.000000000
+0100
+++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734
+0100
@@ -109,11 +109,14 @@
#define
ORTE_UPDATE_EXIT_STATUS
(newstatus) \
do
{ \
if (0 == orte_exit_status && 0 != newstatus)
{ \
+ if ((newstatus & 0377) ==
0) \
+ orte_exit_status =
111; \
+
else \
+ orte_exit_status =
newstatus; \
OPAL_OUTPUT_VERBOSE((1,
orte_debug_output, \
"%s:%s(%d) updating exit status to
%d", \
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
- __FILE__, __LINE__,
newstatus)); \
- orte_exit_status =
newstatus; \
+ __FILE__, __LINE__,
orte_exit_status)); \
} \
} while(0);
<ATT5772424.txt>