I’m not sure why this would be happening. These error outputs go through the “show_help” functionality, and we specifically target it at stderr:
/* create an output stream for us */ OBJ_CONSTRUCT(&lds, opal_output_stream_t); lds.lds_want_stderr = true; orte_help_output = opal_output_open(&lds); Jeff: is it possible the opal_output system is ignoring the request and pushing it to stdout?? Ralph > On Sep 5, 2018, at 4:11 AM, emre brookes <broo...@uthscsa.edu> wrote: > > Thanks Gilles, > > My goal is to separate openmpi errors from the stdout of the MPI program > itself so that errors can be identified externally (in particular in an > external framework running MPI jobs from various developers). > > My not so "well written MPI program" was doing this: > MPI_Finalize(); > exit( errorcode ); > Which I assume you are telling me was bad practice & will replace with > MPI_Abort( MPI_COMM_WORLD, errorcode ); > MPI_Finalize(); > exit( errorcode ); > I was previously a bit put off of MPI_Abort due to the vagueness of the man > page: >> _Description_ >> This routine makes a "best attempt" to abort all tasks in the group of comm. >> This function does not require that the invoking environment take any action >> with the error code. However, a UNIX or POSIX environment should handle this >> as a return errorcode from the main program or an abort (errorcode). > & I didn't really have an MPI issue to "Abort", but had used this for a user > input or parameter issue. > Nevertheless, I accept your best practice recommendation. > > It was not only the originally reported message, other messages went to > stdout. > Initially used the Ubuntu 16 LTS "$ apt install openmpi-bin libopenmpi-dev" > which got me version (1.10.2), > but this morning compiled and tested 2.1.5, with the same behavior, e.g.: > > $ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp > $ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout > [domain-name-embargoed:26078] 1 more process has sent help message > help-mpi-api.txt / mpi-abort > [domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > $ cat stdout > hello from 0 > hello from 1 > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode -1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > $ > > Tested 3.1.2, where this has been *somewhat* fixed: > > $ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp > $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode -1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > [domain-name-embargoed:19784] 1 more process has sent help message > help-mpi-api.txt / mpi-abort > [domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > $ cat stdout > hello from 1 > hello from 0 > $ > > But the originally reported error still goes to stdout: > > $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp > $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, thus > causing > the job to be terminated. The first process to do so was: > > Process name: [[22380,1],0] > Exit code: 255 > -------------------------------------------------------------------------- > $ cat stdout > hello from 0 > hello from 1 > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > ------------------------------------------------------- > $ > > Summary: > 1.10.2, 2.1.5 both send most openmpi generated messages to stdout. > 3.1.2 sends at least one type of openmpi generated messages to stdout. > I'll continue with my "wrapper" strategy for now, as it seems safest and > most broadly deployable [e.g. on compute resources where I need to use admin > installed versions of MPI], > but it would be nice for openmpi to ensure all generated messages end up in > stderr. > > -Emre > > Gilles Gouaillardet wrote: >> Open MPI should likely write this message on stderr, I will have a look at >> that. >> >> >> That being said, and though I have no intention to dodge the question, this >> case should not happen. >> >> A well written (MPI) program should either exit(0) or have main() return 0, >> so this scenario >> >> (e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit >> with a non zero error code) >> >> should not happen. >> >> >> If your program might fail, it should call MPI_Abort() with a non zero error >> code *before* calling MPI_Finalize(). >> >> note this error can occur if your main() subroutine does not return any >> value (e.g. it returns an undefined value, that might be non zero) >> >> >> Cheers, >> >> >> Gilles >> >> >> On 9/5/2018 6:08 AM, emre brookes wrote: >>> Background: >>> --- >>> Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev >>> $ mpirun --version >>> mpirun (Open MPI) 1.10.2 >>> >>> I did search thru the docs a bit (ok, maybe I missed something obvious, my >>> apologies if so) >>> --- >>> Question: >>> >>> Is there some setting to turn off the extra messages generated by openmpi ? >>> >>> e.g. >>> $ mpirun -np 2 my_job > my_job.stdout >>> adds this message to my_job.stdout >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> which strangely goes to stdout and not stderr. >>> I would intuitively expect error or notice messages to go to stderr. >>> Is there a way to redirect these messages to stderr or some specified file? >>> >>> I need to separate this from the collected stdout of the job processes >>> themselves. >>> >>> Somewhat kludgy options that come to mind: >>> >>> 1. I can use --output-filename outfile, which does separate the "openmpi" >>> messages, >>> but this creates a file for each process and I'd rather keep them as >>> produced in one file, >>> but without any messages from openmpi, which I'd like to keep separately. >>> >>> 2. Or I could write a script to filter the output and separate. A bit risky >>> as someone could conceivably put something that looks like a openmpi >>> message pattern in the mpi executable output. >>> >>> 3. hack the source code of openmpi. >>> >>> Any suggestions as to a more elegant or standard way of dealing with this? >>> >>> TIA, >>> Emre. >>> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users