I’m not sure why this would be happening. These error outputs go through the 
“show_help” functionality, and we specifically target it at stderr:

    /* create an output stream for us */
    OBJ_CONSTRUCT(&lds, opal_output_stream_t);
    lds.lds_want_stderr = true;
    orte_help_output = opal_output_open(&lds);

Jeff: is it possible the opal_output system is ignoring the request and pushing 
it to stdout??
Ralph


> On Sep 5, 2018, at 4:11 AM, emre brookes <broo...@uthscsa.edu> wrote:
> 
> Thanks Gilles,
> 
> My goal is to separate openmpi errors from the stdout of the MPI program 
> itself so that errors can be identified externally (in particular in an 
> external framework running MPI jobs from various developers).
> 
> My not so "well written MPI program" was doing this:
>   MPI_Finalize();
>   exit( errorcode );
> Which I assume you are telling me was bad practice & will replace with
>   MPI_Abort( MPI_COMM_WORLD, errorcode );
>   MPI_Finalize();
>   exit( errorcode );
> I was previously a bit put off of MPI_Abort due to the vagueness of the man 
> page:
>> _Description_
>> This routine makes a "best attempt" to abort all tasks in the group of comm. 
>> This function does not require that the invoking environment take any action 
>> with the error code. However, a UNIX or POSIX environment should handle this 
>> as a return errorcode from the main program or an abort (errorcode). 
> & I didn't really have an MPI issue to "Abort", but had used this for a user 
> input or parameter issue.
> Nevertheless, I accept your best practice recommendation.
> 
> It was not only the originally reported message, other messages went to 
> stdout.
> Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin libopenmpi-dev" 
> which got me version (1.10.2),
> but this morning compiled and tested 2.1.5, with the same behavior, e.g.:
> 
> $ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
> $ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
> [domain-name-embargoed:26078] 1 more process has sent help message 
> help-mpi-api.txt / mpi-abort
> [domain-name-embargoed:26078] Set MCA parameter "orte_base_help_aggregate" to 
> 0 to see all help / error messages
> $ cat stdout
> hello from 0
> hello from 1
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -1.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> $
> 
> Tested 3.1.2, where this has been *somewhat* fixed:
> 
> $ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -1.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [domain-name-embargoed:19784] 1 more process has sent help message 
> help-mpi-api.txt / mpi-abort
> [domain-name-embargoed:19784] Set MCA parameter "orte_base_help_aggregate" to 
> 0 to see all help / error messages
> $ cat stdout
> hello from 1
> hello from 0
> $
> 
> But the originally reported error still goes to stdout:
> 
> $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[22380,1],0]
>  Exit code:    255
> --------------------------------------------------------------------------
> $ cat stdout
> hello from 0
> hello from 1
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> $
> 
> Summary:
> 1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
> 3.1.2 sends at least one type of openmpi generated messages to stdout.
> I'll continue with my "wrapper" strategy for now, as it seems safest and
> most broadly deployable [e.g. on compute resources where I need to use admin 
> installed versions of MPI],
> but it would be nice for openmpi to ensure all generated messages end up in 
> stderr.
> 
> -Emre
> 
> Gilles Gouaillardet wrote:
>> Open MPI should likely write this message on stderr, I will have a look at 
>> that.
>> 
>> 
>> That being said, and though I have no intention to dodge the question, this 
>> case should not happen.
>> 
>> A well written (MPI) program should either exit(0) or have main() return 0, 
>> so this scenario
>> 
>> (e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task exit 
>> with a non zero error code)
>> 
>> should not happen.
>> 
>> 
>> If your program might fail, it should call MPI_Abort() with a non zero error 
>> code *before* calling MPI_Finalize().
>> 
>> note this error can occur if your main() subroutine does not return any 
>> value (e.g. it returns an undefined value, that might be non zero)
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> 
>> On 9/5/2018 6:08 AM, emre brookes wrote:
>>> Background:
>>> ---
>>> Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
>>> $  mpirun --version
>>> mpirun (Open MPI) 1.10.2
>>> 
>>> I did search thru the docs a bit (ok, maybe I missed something obvious, my 
>>> apologies if so)
>>> ---
>>> Question:
>>> 
>>> Is there some setting to turn off the extra messages generated by openmpi ?
>>> 
>>> e.g.
>>> $ mpirun -np 2 my_job > my_job.stdout
>>> adds this message to my_job.stdout
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> which strangely goes to stdout and not stderr.
>>> I would intuitively expect error or notice messages to go to stderr.
>>> Is there a way to redirect these messages to stderr or some specified file?
>>> 
>>> I need to separate this from the collected stdout of the job processes 
>>> themselves.
>>> 
>>> Somewhat kludgy options that come to mind:
>>> 
>>> 1. I can use --output-filename outfile, which does separate the "openmpi" 
>>> messages,
>>> but this creates a file for each process and I'd rather keep them as 
>>> produced in one file,
>>> but without any messages from openmpi, which I'd like to keep separately.
>>> 
>>> 2. Or I could write a script to filter the output and separate. A bit risky 
>>> as someone could conceivably put something that looks like a openmpi 
>>> message pattern in the mpi executable output.
>>> 
>>> 3. hack the source code of openmpi.
>>> 
>>> Any suggestions as to a more elegant or standard way of dealing with this?
>>> 
>>> TIA,
>>> Emre.
>>> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to