George,
i tried to mimick this with the latest v1.10, and failed to reproduce
the error.
at first, i recommend you try the latest v1.10 (1.10.4) or event 2.0.1.
unusable stack trace can sometimes be caused by unloaded modules,
so if the issue persists, you might want to try rebuilding Open MPI with
--disable-dlopen
and since you want to do some debugging, you might also want to add
--enable-debug
Then if the issue still persists, i suggest you prepare a trimmed
version of your program
that is enough to reproduce the issue and post it so some of us can have
a look
Cheers,
Gilles
On 10/7/2016 2:53 AM, George Reeke wrote:
Dear colleagues,
I have a parallel MPI application written in C that works normally in
a serial version and in the parallel version in the sense that all
numerical output is correct. When it tries to shut down, it gives the
following console error messsage:
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[51524,1],0]
Exit code: 13
-----End quoted console text-----
The Process name given is not the number of any Linux process.
The Exit code given seems to be any number in the range 12 to 17.
The core dumps produced do not have usable backtrace information.
There is no output on stderr (besides my debug messages).
The last message written by rank 0 node on stdout and flushed is lost.
I cannot determine the cause of the problem.
Let me be as explicit as possible:
OS RHEL 6.8, compiler gcc 4.4.7 with -g, no optimization
Version of MPI (RedHat package): openmpi-1.10-1.10.2-2.el6.x86_64
The startup command is like this:
mpirun --output-filename junk -mca btl_tcp_if_include lo -n 1 cnsP0 NOSP : -n 3
cnsPn < v8tin/dan
cnsP0 is a master code that reads a control file (specified after the
'<' on the command line). The other executables (cnsPn) only send and
receive messages and do math, no file IO. I get same results with
3 or 4 compute nodes.
Early in startup, another process is started via MPI_Comm_spawn.
I suspect this is relevant to the problem, although simple test
programs using the same setup complete normally. This process,
andmsg, receives status or debug information asynchronously via
messages from the other processes and writes them to stderr.
I have tried many versions of the shutdown code, all with the same
result. Here is one version (debug writes (using fwrite()and
fflush()) are deleted, comments modified for clarity):
Application code (cnsP0 and cnsPn):
/* Everything works OK up to here (stdout and debug output). */
int rc, ival = 0;
/* In next line, NC.dmsgid is rank # of andmsg process and
* NC.commd is intercommunicator to it. andmsg counts these
* shutdown messages, one from each app node. */
rc = MPI_Send(&ival, 1, MPI_INT, NC.dmsgid, SHUTDOWN_ANDMSG,
NC.commd);
/* This message confirms that andmsg got 4 SHUTDOWN messages.
* "is_host(NC.node)" returns 1 if this is the rank 0 node. */
if (is_host(NC.node)) { MPI_Recv(&ival, 1, MPI_INT, NC.dmsgid,
CLOSING_ANDMSG, NC.commd, MPI_STATUS_IGNORE); }
/* Results are similar with or without this barrier. Debug lines
* written on stderr from all nodes after barrier appear OK. */
rc = MPI_Barrier(NC.commc); /* NC.commc is original world comm */
/* Behavior is same with or without this extra message exchange,
* which I added to keep andmsg from terminating before the
* barrier among the other nodes completes. */
if (is_host(NC.node)) { rc = MPI_Send(&ival, 1, MPI_INT,
NC.dmsgid, SHUTDOWN_ANDMSG, NC.commd); }
/* Behavior is same with or without this disconnect */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
Spawned process (andmsg) code extract:
if (num2stop <= 0) { /* Countdown of shutdown messages received */
int rc;
/* This message confirms to main app that shutdown messages
* were received from all nodes. */
rc = MPI_Send(&num2stop, 1, MPI_INT, NC.hostid,
CLOSING_ANDMSG, NC.commd);
/* Receive extra synch message commented above */
rc = MPI_Recv(&sdmsg, 1, MPI_INT, NC.hostid, MPI_ANY_TAG,
NC.commd, MPI_STATUS_IGNORE);
sleep(1); /* Results are same with or without this sleep */
/* Results are same with or without this disconnect */
rc = MPI_Comm_disconnect(&NC.commd);
rc = MPI_Finalize();
exit(0);
}
I would much appreciate any suggestions how to debug this.
>From the suggestions at the community help web page, here is more
information:
config.log file, bzipped version, is attached.
ompi_info --all bzipped output is attached.
I am not sending information from other nodes or network config--for
test purposes, all processes are running on the one node, my laptop
with i7 processor. I set the "-mca btl_tcp_if_include lo" parameter
earlier when I got an error message about a refused connection
(that my code never asked for in the first place). This got rid
of that error message but application still fails and dumps.
Thanks,
George Reeke
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users