Dear colleagues, I have a parallel MPI application written in C that works normally in a serial version and in the parallel version in the sense that all numerical output is correct. When it tries to shut down, it gives the following console error messsage:
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[51524,1],0] Exit code: 13 -----End quoted console text----- The Process name given is not the number of any Linux process. The Exit code given seems to be any number in the range 12 to 17. The core dumps produced do not have usable backtrace information. There is no output on stderr (besides my debug messages). The last message written by rank 0 node on stdout and flushed is lost. I cannot determine the cause of the problem. Let me be as explicit as possible: OS RHEL 6.8, compiler gcc 4.4.7 with -g, no optimization Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64 The startup command is like this: mpirun --output-filename junk -mca btl_tcp_if_include lo -n 1 cnsP0 NOSP : -n 3 cnsPn < v8tin/dan cnsP0 is a master code that reads a control file (specified after the '<' on the command line). The other executables (cnsPn) only send and receive messages and do math, no file IO. I get same results with 3 or 4 compute nodes. Early in startup, another process is started via MPI_Comm_spawn. I suspect this is relevant to the problem, although simple test programs using the same setup complete normally. This process, andmsg, receives status or debug information asynchronously via messages from the other processes and writes them to stderr. I have tried many versions of the shutdown code, all with the same result. Here is one version (debug writes (using fwrite()and fflush()) are deleted, comments modified for clarity): Application code (cnsP0 and cnsPn): /* Everything works OK up to here (stdout and debug output). */ int rc, ival = 0; /* In next line, NC.dmsgid is rank # of andmsg process and * NC.commd is intercommunicator to it. andmsg counts these * shutdown messages, one from each app node. */ rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); /* This message confirms that andmsg got 4 SHUTDOWN messages. * "is_host(NC.node)" returns 1 if this is the rank 0 node. */ if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid, CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE); } /* Results are similar with or without this barrier. Debug lines * written on stderr from all nodes after barrier appear OK. */ rc = MPI_Barrier(NC.commc); /* NC.commc is original world comm */ /* Behavior is same with or without this extra message exchange, * which I added to keep andmsg from terminating before the * barrier among the other nodes completes. */ if (is_host(NC.node)) { rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); } /* Behavior is same with or without this disconnect */ rc = MPI_Comm_disconnect(&NC.commd); rc = MPI_Finalize(); exit(0); Spawned process (andmsg) code extract: if (num2stop <= 0) { /* Countdown of shutdown messages received */ int rc; /* This message confirms to main app that shutdown messages * were received from all nodes. */ rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid, CLOSING_ANDMSG, NC.commd); /* Receive extra synch message commented above */ rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG, NC.commd, MPI_STATUS_IGNORE); sleep(1); /* Results are same with or without this sleep */ /* Results are same with or without this disconnect */ rc = MPI_Comm_disconnect(&NC.commd); rc = MPI_Finalize(); exit(0); } I would much appreciate any suggestions how to debug this. >From the suggestions at the community help web page, here is more information: config.log file, bzipped version, is attached. ompi_info --all bzipped output is attached. I am not sending information from other nodes or network config--for test purposes, all processes are running on the one node, my laptop with i7 processor. I set the "-mca btl_tcp_if_include lo" parameter earlier when I got an error message about a refused connection (that my code never asked for in the first place). This got rid of that error message but application still fails and dumps. Thanks, George Reeke
config.log.bz2
Description: application/bzip
ompi_info.output.bz2
Description: application/bzip
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users