Gilles: Can you submit a PR to fix these 2 places?

Thanks!

> On Sep 11, 2018, at 9:10 AM, emre brookes <broo...@uthscsa.edu> wrote:
> 
> Gilles Gouaillardet wrote:
>> It seems I got it wrong :-(
> Ah, you've joined the rest of us :)
>> 
>> Can you please give the attached patch a try ?
>> 
> Working with a git clone of 3.1.x, patch applied
> 
> $ /src/ompi-3.1.x/bin/mpicxx test.cpp
> $ /src/ompi-3.1.x/bin/mpirun a.out > stdout
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
> Process name: [[2667,1],2]
> Exit code:    255
> --------------------------------------------------------------------------
> $ cat stdout
> hello from 1
> hello from 2
> hello from 3
> hello from 5
> hello from 0
> hello from 4
> $
> 
> Works correctly for this error message.
> 
> Thanks,
> -Emre
> 
>> 
>> FWIW, an other option would be to opal_output(orte_help_output, ...) but we 
>> would have to make orte_help_output "public first.
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> On 9/11/2018 11:14 AM, emre brookes wrote:
>>> Gilles Gouaillardet wrote:
>>>> I investigated a this a bit and found that the (latest ?) v3 branches have 
>>>> the expected behavior
>>>> 
>>>> (e.g. the error messages is sent to stderr)
>>>> 
>>>> 
>>>> Since it is very unlikely Open MPI 2.1 will ever be updated, I can simply 
>>>> encourage you to upgrade to a newer Open MPI version.
>>>> 
>>>> Latest fully supported versions are currently such as 3.1.2 or 3.0.2
>>>> 
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> 
>>> So you tested 3.1.2 or something newer with this error?
>>> 
>>>> But the originally reported error still goes to stdout:
>>>> 
>>>> $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
>>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>>> -------------------------------------------------------------------------- 
>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>> thus causing
>>>> the job to be terminated. The first process to do so was:
>>>> 
>>>>  Process name: [[22380,1],0]
>>>>  Exit code:    255
>>>> -------------------------------------------------------------------------- 
>>>> $ cat stdout
>>>> hello from 0
>>>> hello from 1
>>>> -------------------------------------------------------
>>>> Primary job  terminated normally, but 1 process returned
>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>> -------------------------------------------------------
>>>> $
>>> -Emre
>>> 
>>> 
>>> 
>>>> 
>>>> On 9/11/2018 2:27 AM, Ralph H Castain wrote:
>>>>> I’m not sure why this would be happening. These error outputs go through 
>>>>> the “show_help” functionality, and we specifically target it at stderr:
>>>>> 
>>>>>     /* create an output stream for us */
>>>>>     OBJ_CONSTRUCT(&lds, opal_output_stream_t);
>>>>>     lds.lds_want_stderr = true;
>>>>>     orte_help_output = opal_output_open(&lds);
>>>>> 
>>>>> Jeff: is it possible the opal_output system is ignoring the request and 
>>>>> pushing it to stdout??
>>>>> Ralph
>>>>> 
>>>>> 
>>>>>> On Sep 5, 2018, at 4:11 AM, emre brookes <broo...@uthscsa.edu> wrote:
>>>>>> 
>>>>>> Thanks Gilles,
>>>>>> 
>>>>>> My goal is to separate openmpi errors from the stdout of the MPI program 
>>>>>> itself so that errors can be identified externally (in particular in an 
>>>>>> external framework running MPI jobs from various developers).
>>>>>> 
>>>>>> My not so "well written MPI program" was doing this:
>>>>>>   MPI_Finalize();
>>>>>>   exit( errorcode );
>>>>>> Which I assume you are telling me was bad practice & will replace with
>>>>>>   MPI_Abort( MPI_COMM_WORLD, errorcode );
>>>>>>   MPI_Finalize();
>>>>>>   exit( errorcode );
>>>>>> I was previously a bit put off of MPI_Abort due to the vagueness of the 
>>>>>> man page:
>>>>>>> _Description_
>>>>>>> This routine makes a "best attempt" to abort all tasks in the group of 
>>>>>>> comm. This function does not require that the invoking environment take 
>>>>>>> any action with the error code. However, a UNIX or POSIX environment 
>>>>>>> should handle this as a return errorcode from the main program or an 
>>>>>>> abort (errorcode).
>>>>>> & I didn't really have an MPI issue to "Abort", but had used this for a 
>>>>>> user input or parameter issue.
>>>>>> Nevertheless, I accept your best practice recommendation.
>>>>>> 
>>>>>> It was not only the originally reported message, other messages went to 
>>>>>> stdout.
>>>>>> Initially used the Ubuntu 16 LTS  "$ apt install openmpi-bin 
>>>>>> libopenmpi-dev" which got me version (1.10.2),
>>>>>> but this morning compiled and tested 2.1.5, with the same behavior, e.g.:
>>>>>> 
>>>>>> $ /src/ompi-2.1.5/bin/mpicxx test_using_mpi_abort.cpp
>>>>>> $ /src/ompi-2.1.5/bin/mpirun -np 2 a.out > stdout
>>>>>> [domain-name-embargoed:26078] 1 more process has sent help message 
>>>>>> help-mpi-api.txt / mpi-abort
>>>>>> [domain-name-embargoed:26078] Set MCA parameter 
>>>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>>> $ cat stdout
>>>>>> hello from 0
>>>>>> hello from 1
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode -1.
>>>>>> 
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> $
>>>>>> 
>>>>>> Tested 3.1.2, where this has been *somewhat* fixed:
>>>>>> 
>>>>>> $ /src/ompi-3.1.2/bin/mpicxx test_using_mpi_abort.cpp
>>>>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>> with errorcode -1.
>>>>>> 
>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>> You may or may not see output from other processes, depending on
>>>>>> exactly when Open MPI kills them.
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> [domain-name-embargoed:19784] 1 more process has sent help message 
>>>>>> help-mpi-api.txt / mpi-abort
>>>>>> [domain-name-embargoed:19784] Set MCA parameter 
>>>>>> "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>>> $ cat stdout
>>>>>> hello from 1
>>>>>> hello from 0
>>>>>> $
>>>>>> 
>>>>>> But the originally reported error still goes to stdout:
>>>>>> 
>>>>>> $ /src/ompi-3.1.2/bin/mpicxx test_without_mpi_abort.cpp
>>>>>> $ /src/ompi-3.1.2/bin/mpirun -np 2 a.out > stdout
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> mpirun detected that one or more processes exited with non-zero status, 
>>>>>> thus causing
>>>>>> the job to be terminated. The first process to do so was:
>>>>>> 
>>>>>>  Process name: [[22380,1],0]
>>>>>>  Exit code:    255
>>>>>> --------------------------------------------------------------------------
>>>>>>  
>>>>>> $ cat stdout
>>>>>> hello from 0
>>>>>> hello from 1
>>>>>> -------------------------------------------------------
>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>> a non-zero exit code. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>> $
>>>>>> 
>>>>>> Summary:
>>>>>> 1.10.2, 2.1.5 both send most openmpi generated messages to stdout.
>>>>>> 3.1.2 sends at least one type of openmpi generated messages to stdout.
>>>>>> I'll continue with my "wrapper" strategy for now, as it seems safest and
>>>>>> most broadly deployable [e.g. on compute resources where I need to use 
>>>>>> admin installed versions of MPI],
>>>>>> but it would be nice for openmpi to ensure all generated messages end up 
>>>>>> in stderr.
>>>>>> 
>>>>>> -Emre
>>>>>> 
>>>>>> Gilles Gouaillardet wrote:
>>>>>>> Open MPI should likely write this message on stderr, I will have a look 
>>>>>>> at that.
>>>>>>> 
>>>>>>> 
>>>>>>> That being said, and though I have no intention to dodge the question, 
>>>>>>> this case should not happen.
>>>>>>> 
>>>>>>> A well written (MPI) program should either exit(0) or have main() 
>>>>>>> return 0, so this scenario
>>>>>>> 
>>>>>>> (e.g. all MPI tasks call MPI_Finalize() and then at least one MPI task 
>>>>>>> exit with a non zero error code)
>>>>>>> 
>>>>>>> should not happen.
>>>>>>> 
>>>>>>> 
>>>>>>> If your program might fail, it should call MPI_Abort() with a non zero 
>>>>>>> error code *before* calling MPI_Finalize().
>>>>>>> 
>>>>>>> note this error can occur if your main() subroutine does not return any 
>>>>>>> value (e.g. it returns an undefined value, that might be non zero)
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> 
>>>>>>> On 9/5/2018 6:08 AM, emre brookes wrote:
>>>>>>>> Background:
>>>>>>>> ---
>>>>>>>> Running on ubuntu 16.04 with apt install openmpi-bin libopenmpi-dev
>>>>>>>> $  mpirun --version
>>>>>>>> mpirun (Open MPI) 1.10.2
>>>>>>>> 
>>>>>>>> I did search thru the docs a bit (ok, maybe I missed something 
>>>>>>>> obvious, my apologies if so)
>>>>>>>> ---
>>>>>>>> Question:
>>>>>>>> 
>>>>>>>> Is there some setting to turn off the extra messages generated by 
>>>>>>>> openmpi ?
>>>>>>>> 
>>>>>>>> e.g.
>>>>>>>> $ mpirun -np 2 my_job > my_job.stdout
>>>>>>>> adds this message to my_job.stdout
>>>>>>>> -------------------------------------------------------
>>>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>>>> -------------------------------------------------------
>>>>>>>> which strangely goes to stdout and not stderr.
>>>>>>>> I would intuitively expect error or notice messages to go to stderr.
>>>>>>>> Is there a way to redirect these messages to stderr or some specified 
>>>>>>>> file?
>>>>>>>> 
>>>>>>>> I need to separate this from the collected stdout of the job processes 
>>>>>>>> themselves.
>>>>>>>> 
>>>>>>>> Somewhat kludgy options that come to mind:
>>>>>>>> 
>>>>>>>> 1. I can use --output-filename outfile, which does separate the 
>>>>>>>> "openmpi" messages,
>>>>>>>> but this creates a file for each process and I'd rather keep them as 
>>>>>>>> produced in one file,
>>>>>>>> but without any messages from openmpi, which I'd like to keep 
>>>>>>>> separately.
>>>>>>>> 
>>>>>>>> 2. Or I could write a script to filter the output and separate. A bit 
>>>>>>>> risky as someone could conceivably put something that looks like a 
>>>>>>>> openmpi message pattern in the mpi executable output.
>>>>>>>> 
>>>>>>>> 3. hack the source code of openmpi.
>>>>>>>> 
>>>>>>>> Any suggestions as to a more elegant or standard way of dealing with 
>>>>>>>> this?
>>>>>>>> 
>>>>>>>> TIA,
>>>>>>>> Emre.
>>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to