On Mar 20, 2009, at 4:21 AM, Cristian KLEIN wrote:

Jeff Squyres a écrit :
I believe that this was just fixed in OMPI v1.3.1 -- could you try
upgrading?

Yup, the issue is well solved. :)

I would just want to add one thing. Isn't the current solution a little bit error prone. I mean, instead of having to check before each call to
ORTE_UPDATE_EXIT_STATUS, whether the low 8 bits are indeed non-zero,
wouldn't it be wiser to have ORTE_UPDATE_EXIT_STATUS do the check?

Because many times we set the exit status with a value that doesn't come from a process termination, but rather from some internal error return. In those cases, you can't use the usual OS-specific macros to test for abnormal termination, so you cannot put the test in the ORTE_UPDATE_EXIT_STATUS code.





On Mar 19, 2009, at 10:58 AM, Cristian KLEIN wrote:

Hello everybody,

I've been using OpenMPI 1.3's mpirun in Makefiles and observed that the
exit status is not always the one I expect. For example, using an
incorrect machinefile makes mpirun return 0, whereas a non-zero value
would be expected:

--- cut here ---
masternode:~/grid/myTests/hellompi$ env | grep OMPI
OMPI_MCA_plm_rsh_agent=ssh
OMPI_MCA_btl_tcp_if_exclude=lo,myri0
OMPI_MCA_btl=self,tcp

masternode:~/grid/myTests/hellompi$ mpirun.openmpi -machinefile hostfile
./hellompi.openmpi; echo $?
ssh: incorrecthost2.example.com: Name or service not known
ssh: incorrecthost1.example.com: Name or service not known
[snip]
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

mpirun: clean termination accomplished

0
--- end here ---

The problem comes from the fact that the exitstatus of a process is ORed
with 0xFF and OpenMPI does not take this into consideration. In my
example, the exit status generated was 65280, which has the lower 8 bits
zero.

To solve this problem I suggest the attached patch. If the exitstatus
would become zero, it replaces it with 111, where 111 is obviously a
randomly chosen non-zero number. :D
--- orte/runtime/orte_globals.h.orig 2009-01-09 18:55:22.000000000
+0100
+++ orte/runtime/orte_globals.h 2009-03-19 15:44:06.822708734 +0100
@@ -109,11 +109,14 @@
#define
ORTE_UPDATE_EXIT_STATUS (newstatus) \
   do
{                                                                    \
       if (0 == orte_exit_status && 0 != newstatus)
{                      \
+            if ((newstatus & 0377) ==
0)                                    \
+                orte_exit_status =
111;                                          \
+
else                                                            \
+                orte_exit_status =
newstatus;                               \
           OPAL_OUTPUT_VERBOSE((1,
orte_debug_output,                      \
                                "%s:%s(%d) updating exit status to
%d",    \

ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),        \
-                                 __FILE__, __LINE__,
newstatus));           \
-            orte_exit_status =
newstatus;                                   \
+                                 __FILE__, __LINE__,
orte_exit_status));    \

}                                                                   \
   } while(0);

<ATT5772424.txt>



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to