I see. Then you understand correctly - we are not going to fix the v1.10 series.

> On Nov 8, 2017, at 10:47 AM, Nikolas Antolin <nanto...@gmail.com> wrote:
> 
> That was not my interpretation. His message said he did not observe the race 
> condition for 2 processes, but did for 6 processes. I observe a failure to 
> exit mpirun around 25-30% of the time with 2 processes, causing an 
> inconsistent hang in both my example program and my larger application.
> 
> -Nik
> 
> On Nov 8, 2017 11:40, "r...@open-mpi.org <mailto:r...@open-mpi.org>" 
> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
> According to the other reporter, it has been fixed in 1.10.7. I haven’t 
> verified that, but I’d suggest trying it first.
> 
> 
>> On Nov 8, 2017, at 8:26 AM, Nikolas Antolin <nanto...@gmail.com 
>> <mailto:nanto...@gmail.com>> wrote:
>> 
>> Thank you for the replies. Do I understand correctly that since OpenMPI 
>> v1.10 is no longer supported, I am unlikely to see a bug fix for this 
>> without moving to v2.x or v3.x? I am dealing with clusters where the 
>> administrators may be loathe to update packages until it is absolutely 
>> necessary, and want to present them with a complete outlook on the problem.
>> 
>> Thanks,
>> Nik
>> 
>> 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> <r...@open-mpi.org <mailto:r...@open-mpi.org>>:
>> Glad to hear it has already been fixed :-)
>> 
>> Thanks!
>> 
>>> On Nov 7, 2017, at 4:13 PM, Tru Huynh <t...@pasteur.fr 
>>> <mailto:t...@pasteur.fr>> wrote:
>>> 
>>> Hi,
>>> 
>>> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
>>>> Hello,
>>>> 
>>>> In debugging a test of an application, I recently came across odd behavior
>>>> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
>>>> acknowledged by the process output, the mpirun process failed to exit. I
>>>> was able to duplicate this behavior on multiple machines with OpenMPI
>>>> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
>>>> 
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <unistd.h>
>>>> #include <stdbool.h>
>>>> 
>>>> int main(int argc, char **argv)
>>>> {
>>>>    int rank;
>>>> 
>>>>    MPI_Init(&argc,&argv);
>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>> 
>>>>    printf("I am process number %d\n", rank);
>>>>    MPI_Abort(MPI_COMM_WORLD, 3);
>>>>    return 0;
>>>> }
>>>> 
>>>> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
>>>> 2.0 and 3.0?
>>> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
>>> 3.0.0 and the program seems to run fine.
>>> 
>>> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do             
>>>                                                                             
>>>                                                                             
>>>                  module purge && module add openmpi/$i && mpicc aa.c -o 
>>> aa-$i && ldd aa-$i; mpirun  -n 2 ./aa-$i ; done                             
>>>                                                                             
>>>                  
>>>     linux-vdso.so.1 =>  (0x00007ffe115bd000)
>>>     libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 
>>> (0x00007f40d7b4a000)
>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f40d78f7000)
>>>     libc.so.6 => /lib64/libc.so.6 (0x00007f40d7534000)
>>>     libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 
>>> (0x00007f40d72b8000)
>>>     libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 
>>> (0x00007f40d6fd9000)
>>>     libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f40d6dcd000)
>>>     libdl.so.2 => /lib64/libdl.so.2 (0x00007f40d6bc9000)
>>>     librt.so.1 => /lib64/librt.so.1 (0x00007f40d69c0000)
>>>     libm.so.6 => /lib64/libm.so.6 (0x00007f40d66be000)
>>>     libutil.so.1 => /lib64/libutil.so.1 (0x00007f40d64bb000)
>>>     /lib64/ld-linux-x86-64.so.2 (0x000055f6d96c4000)
>>>     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f40d62a4000)
>>> I am process number 1
>>> I am process number 0
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] 1 more 
>>> process has sent help message help-mpi-api.txt / mpi-abort
>>> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] Set MCA 
>>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>     linux-vdso.so.1 =>  (0x00007fffaabcd000)
>>>     libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 
>>> (0x00007f5bcee39000)
>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bcebe6000)
>>>     libc.so.6 => /lib64/libc.so.6 (0x00007f5bce823000)
>>>     libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 
>>> (0x00007f5bce5a0000)
>>>     libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 
>>> (0x00007f5bce2a7000)
>>>     libdl.so.2 => /lib64/libdl.so.2 (0x00007f5bce0a3000)
>>>     libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bcde97000)
>>>     libudev.so.1 => /lib64/libudev.so.1 (0x00007f5bcde81000)
>>>     librt.so.1 => /lib64/librt.so.1 (0x00007f5bcdc79000)
>>>     libm.so.6 => /lib64/libm.so.6 (0x00007f5bcd977000)
>>>     libutil.so.1 => /lib64/libutil.so.1 (0x00007f5bcd773000)
>>>     /lib64/ld-linux-x86-64.so.2 (0x000055718df01000)
>>>     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bcd55d000)
>>>     libcap.so.2 => /lib64/libcap.so.2 (0x00007f5bcd357000)
>>>     libdw.so.1 => /lib64/libdw.so.1 (0x00007f5bcd110000)
>>>     libattr.so.1 => /lib64/libattr.so.1 (0x00007f5bccf0b000)
>>>     libelf.so.1 => /lib64/libelf.so.1 (0x00007f5bcccf2000)
>>>     libz.so.1 => /lib64/libz.so.1 (0x00007f5bccadc000)
>>>     liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f5bcc8b6000)
>>>     libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f5bcc6a5000)
>>> I am process number 1
>>> I am process number 0
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> [borma.bis.pasteur.fr:08534 <http://borma.bis.pasteur.fr:8534/>] 1 more 
>>> process has sent help message help-mpi-api.txt / mpi-abort
>>> [borma.bis.pasteur.fr:08534 <http://borma.bis.pasteur.fr:8534/>] Set MCA 
>>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>     linux-vdso.so.1 =>  (0x00007ffc09585000)
>>>     libmpi.so.40 => /c7/shared/openmpi/3.0.0/lib/libmpi.so.40 
>>> (0x00007fa208ffc000)
>>>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208da9000)
>>>     libc.so.6 => /lib64/libc.so.6 (0x00007fa2089e6000)
>>>     libopen-rte.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-rte.so.40 
>>> (0x00007fa208734000)
>>>     libopen-pal.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-pal.so.40 
>>> (0x00007fa208431000)
>>>     libdl.so.2 => /lib64/libdl.so.2 (0x00007fa20822d000)
>>>     libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa208021000)
>>>     libudev.so.1 => /lib64/libudev.so.1 (0x00007fa20800b000)
>>>     librt.so.1 => /lib64/librt.so.1 (0x00007fa207e03000)
>>>     libm.so.6 => /lib64/libm.so.6 (0x00007fa207b01000)
>>>     libutil.so.1 => /lib64/libutil.so.1 (0x00007fa2078fd000)
>>>     libz.so.1 => /lib64/libz.so.1 (0x00007fa2076e7000)
>>>     /lib64/ld-linux-x86-64.so.2 (0x000055e717175000)
>>>     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa2074d0000)
>>>     libcap.so.2 => /lib64/libcap.so.2 (0x00007fa2072cb000)
>>>     libdw.so.1 => /lib64/libdw.so.1 (0x00007fa207084000)
>>>     libattr.so.1 => /lib64/libattr.so.1 (0x00007fa206e7e000)
>>>     libelf.so.1 => /lib64/libelf.so.1 (0x00007fa206c66000)
>>>     liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa206a40000)
>>>     libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fa20682f000)
>>> I am process number 0
>>> I am process number 1
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> [borma.bis.pasteur.fr:08561 <http://borma.bis.pasteur.fr:8561/>] 1 more 
>>> process has sent help message help-mpi-api.txt / mpi-abort
>>> [borma.bis.pasteur.fr:08561 <http://borma.bis.pasteur.fr:8561/>] Set MCA 
>>> parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>> 
>>> When I increased the number of MPI processes from 2 to 6 (the number of 
>>> cores
>>> of the desktop), only the openmpi-1.10.7 built version hang (killed with 
>>> ctrl-c),
>>> no errors with the 2.1.2 and 3.0.0 versions.
>>> 
>>> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do             
>>>                                                                             
>>>                                                                             
>>>                  module purge && module add openmpi/$i; echo $i;  mpirun  
>>> -n 6 ./aa-$i ; done                                                         
>>>                                                                  
>>> 1.10.7
>>> I am process number 0
>>> I am process number 1
>>> I am process number 2
>>> I am process number 3
>>> I am process number 4
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> I am process number 5
>>> 
>>> 
>>> 
>>> ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly 
>>> terminate
>>> 
>>> 2.1.2
>>> I am process number 2
>>> I am process number 3
>>> I am process number 4
>>> I am process number 0
>>> I am process number 1
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
>>> with errorcode 3.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>> You may or may not see output from other processes, depending on
>>> exactly when Open MPI kills them.
>>> --------------------------------------------------------------------------
>>> I am process number 5
>>> [borma.bis.pasteur.fr:10542 <http://borma.bis.pasteur.fr:10542/>] 5 more 
>>> processes have sent help message help-mpi-api.txt / mpi-abort
>>> [borma.bis.pasteur.fr:10542 <http://borma.bis.pasteur.fr:10542/>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
> ...
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to