Looks like there is a place in orte/mca/state/state_base_fns.c:850 that also 
outputs to orte_clean_output instead of using show_help. Outside of those two 
places, everything else seems to go to show_help.


> On Sep 10, 2018, at 8:58 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> It seems I got it wrong :-(
> 
> 
> Can you please give the attached patch a try ?
> 
> 
> FWIW, an other option would be to opal_output(orte_help_output, ...) but we 
> would have to make orte_help_output "public first.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> 
> 
> On 9/11/2018 11:14 AM, emre brookes wrote:
>> Gilles Gouaillardet wrote:
>>> I investigated a this a bit and found that the (latest ?) v3 branches have 
>>> the expected behavior
>>> 
>>> (e.g. the error messages is sent to stderr)
>>> 
>>> 
>>> Since it is very unlikely Open MPI 2.1 will ever be updated, I can simply 
>>> encourage you to upgrade to a newer Open MPI version.
>>> 
>>> Latest fully supported versions are currently such as 3.1.2 or 3.0.2
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>> So you tested 3.1.2 or something newer with this error?
>> 
>>> But the originally reported error still goes to stdout:
>>> 
>>> $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>> -------------------------------------------------------------------------- 
>>> mpirun detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>   Process name: [[22380,1],0]
>>>   Exit code:    255
>>> -------------------------------------------------------------------------- 
>>> $ cat stdout
>>> hello from 0
>>> hello from 1
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> $
>> -Emre
>> 
>> 
>> 
>>> 
>>> On 9/11/2018 2:27 AM, Ralph H Castain wrote:
>>>> I’m not sure why this would be happening. These error outputs go through 
>>>> the “show_help” functionality, and we specifically target it at stderr:
>>>> 
>>>>      /* create an output stream for us */
>>>>      OBJ_CONSTRUCT(&lds, opal_output_stream_t);
>>>>      lds.lds_want_stderr = true;
>>>>      orte_help_output = opal_output_open(&lds);
>>>> 
>>>> Jeff: is it possible the opal_output system is ignoring the request and 
>>>> pushing it to stdout??
>>>> Ralph
>>>> 
>>>> 
>>>>> On Sep 5, 2018, at 4:11 AM, emre brookes <broo...@uthscsa.edu> wrote:
>>>>> 
>>>>> Thanks Gilles,
>>>>> 
>>>>> My goal is to separate openmpi errors from the stdout of the MPI program 
>>>>> itself so that errors can be identified externally (in particular in an 
>>>>> external framework running MPI jobs from various developers).
>>>>> 
>>>>> My not so "well written MPI program" was doing this:
>>>>>    MPI_Finalize();
>>>>>    exit( errorcode );
>>>>> Which I assume you are telling me was bad practice & will replace with
>>>>>    MPI_Abort( MPI_COMM_WORLD, errorcode );
>>>>>    MPI_Finalize();
>>>>>    exit( errorcode );
>>>>> I was previously a bit put off of MPI_Abort due to the vagueness of the 
>>>>> man page:
>>>>>> _Description_
>>>>>> This routine makes a "best attempt" to abort all tasks in the group of 
>>>>>> comm. This function does not require that the invoking environment take 
>>>>>> any action with the error code. However, a UNIX or POSIX environment 
>>>>>> should handle this as a return errorcode from the main program or an 
>>>>>> abort (errorcode).
>>>>> & I didn't really have an MPI issue to "Abort", but had used this for a 
>>>>> user input or parameter issue.
>>>>> Nevertheless, I accept your best practice recommendation.
>>>>> 
>>>>> It was not only the originally reported message, other messages went to 
>>>>> stdout.
>>>>> Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin 
>>>>> libopenmpi-dev" which got me version (1.10.2),
>>>>> but this morning compiled and tested 2.1.5, with the same behavior, e.g.:
>>>>> 
>>>>> $ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
>>>>> $ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
>>>>> [domain-name-embargoed:26078] 1 more process has sent help message 
>>>>> help-mpi-api.txt / mpi-abort
>>>>> [domain-name-embargoed:26078] Set MCA parameter 
>>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>> $ cat stdout
>>>>> hello from 0
>>>>> hello from 1
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode -1.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> $
>>>>> 
>>>>> Tested 3.1.2, where this has been *somewhat* fixed:
>>>>> 
>>>>> $ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
>>>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode -1.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> [domain-name-embargoed:19784] 1 more process has sent help message 
>>>>> help-mpi-api.txt / mpi-abort
>>>>> [domain-name-embargoed:19784] Set MCA parameter 
>>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>> $ cat stdout
>>>>> hello from 1
>>>>> hello from 0
>>>>> $
>>>>> 
>>>>> But the originally reported error still goes to stdout:
>>>>> 
>>>>> $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
>>>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>> 
>>>>>   Process name: [[22380,1],0]
>>>>>   Exit code:    255
>>>>> --------------------------------------------------------------------------
>>>>>  
>>>>> $ cat stdout
>>>>> hello from 0
>>>>> hello from 1
>>>>> -------------------------------------------------------
>>>>> Primary job  terminated normally, but 1 process returned
>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> $
>>>>> 
>>>>> Summary:
>>>>> 1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
>>>>> 3.1.2 sends at least one type of openmpi generated messages to stdout.
>>>>> I'll continue with my "wrapper" strategy for now, as it seems safest and
>>>>> most broadly deployable [e.g. on compute resources where I need to use 
>>>>> admin installed versions of MPI],
>>>>> but it would be nice for openmpi to ensure all generated messages end up 
>>>>> in stderr.
>>>>> 
>>>>> -Emre
>>>>> 
>>>>> Gilles Gouaillardet wrote:
>>>>>> Open MPI should likely write this message on stderr, I will have a look 
>>>>>> at that.
>>>>>> 
>>>>>> 
>>>>>> That being said, and though I have no intention to dodge the question, 
>>>>>> this case should not happen.
>>>>>> 
>>>>>> A well written (MPI) program should either exit(0) or have main() return 
>>>>>> 0, so this scenario
>>>>>> 
>>>>>> (e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task 
>>>>>> exit with a non zero error code)
>>>>>> 
>>>>>> should not happen.
>>>>>> 
>>>>>> 
>>>>>> If your program might fail, it should call MPI_Abort() with a non zero 
>>>>>> error code *before* calling MPI_Finalize().
>>>>>> 
>>>>>> note this error can occur if your main() subroutine does not return any 
>>>>>> value (e.g. it returns an undefined value, that might be non zero)
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> 
>>>>>> On 9/5/2018 6:08 AM, emre brookes wrote:
>>>>>>> Background:
>>>>>>> ---
>>>>>>> Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
>>>>>>> $  mpirun --version
>>>>>>> mpirun (Open MPI) 1.10.2
>>>>>>> 
>>>>>>> I did search thru the docs a bit (ok, maybe I missed something obvious, 
>>>>>>> my apologies if so)
>>>>>>> ---
>>>>>>> Question:
>>>>>>> 
>>>>>>> Is there some setting to turn off the extra messages generated by 
>>>>>>> openmpi ?
>>>>>>> 
>>>>>>> e.g.
>>>>>>> $ mpirun -np 2 my_job > my_job.stdout
>>>>>>> adds this message to my_job.stdout
>>>>>>> -------------------------------------------------------
>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>> -------------------------------------------------------
>>>>>>> which strangely goes to stdout and not stderr.
>>>>>>> I would intuitively expect error or notice messages to go to stderr.
>>>>>>> Is there a way to redirect these messages to stderr or some specified 
>>>>>>> file?
>>>>>>> 
>>>>>>> I need to separate this from the collected stdout of the job processes 
>>>>>>> themselves.
>>>>>>> 
>>>>>>> Somewhat kludgy options that come to mind:
>>>>>>> 
>>>>>>> 1. I can use --output-filename outfile, which does separate the 
>>>>>>> "openmpi" messages,
>>>>>>> but this creates a file for each process and I'd rather keep them as 
>>>>>>> produced in one file,
>>>>>>> but without any messages from openmpi, which I'd like to keep 
>>>>>>> separately.
>>>>>>> 
>>>>>>> 2. Or I could write a script to filter the output and separate. A bit 
>>>>>>> risky as someone could conceivably put something that looks like a 
>>>>>>> openmpi message pattern in the mpi executable output.
>>>>>>> 
>>>>>>> 3. hack the source code of openmpi.
>>>>>>> 
>>>>>>> Any suggestions as to a more elegant or standard way of dealing with 
>>>>>>> this?
>>>>>>> 
>>>>>>> TIA,
>>>>>>> Emre.
>>>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> <default_hnp_abort.diff>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to