Hi,

On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote:
> Hello,
> 
> In debugging a test of an application, I recently came across odd behavior
> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was
> acknowledged by the process output, the mpirun process failed to exit. I
> was able to duplicate this behavior on multiple machines with OpenMPI
> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program:
> 
> #include <mpi.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdbool.h>
> 
> int main(int argc, char **argv)
> {
>     int rank;
> 
>     MPI_Init(&argc,&argv);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 
>     printf("I am process number %d\n", rank);
>     MPI_Abort(MPI_COMM_WORLD, 3);
>     return 0;
> }
> 
> Is this a bug or a feature? Does this behavior exist in OpenMPI versions
> 2.0 and 3.0?
I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and
3.0.0 and the program seems to run fine.

[tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do                 
                                                                                
                                                                                
     module purge && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; 
mpirun  -n 2 ./aa-$i ; done                                                     
                                                                     
        linux-vdso.so.1 =>  (0x00007ffe115bd000)
        libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 
(0x00007f40d7b4a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f40d78f7000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f40d7534000)
        libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 
(0x00007f40d72b8000)
        libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 
(0x00007f40d6fd9000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f40d6dcd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f40d6bc9000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f40d69c0000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f40d66be000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f40d64bb000)
        /lib64/ld-linux-x86-64.so.2 (0x000055f6d96c4000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f40d62a4000)
I am process number 1
I am process number 0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[borma.bis.pasteur.fr:08511] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:08511] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
        linux-vdso.so.1 =>  (0x00007fffaabcd000)
        libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 
(0x00007f5bcee39000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bcebe6000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f5bce823000)
        libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 
(0x00007f5bce5a0000)
        libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 
(0x00007f5bce2a7000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f5bce0a3000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bcde97000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00007f5bcde81000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f5bcdc79000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f5bcd977000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f5bcd773000)
        /lib64/ld-linux-x86-64.so.2 (0x000055718df01000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bcd55d000)
        libcap.so.2 => /lib64/libcap.so.2 (0x00007f5bcd357000)
        libdw.so.1 => /lib64/libdw.so.1 (0x00007f5bcd110000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00007f5bccf0b000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007f5bcccf2000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f5bccadc000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f5bcc8b6000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f5bcc6a5000)
I am process number 1
I am process number 0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[borma.bis.pasteur.fr:08534] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:08534] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
        linux-vdso.so.1 =>  (0x00007ffc09585000)
        libmpi.so.40 => /c7/shared/openmpi/3.0.0/lib/libmpi.so.40 
(0x00007fa208ffc000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208da9000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa2089e6000)
        libopen-rte.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-rte.so.40 
(0x00007fa208734000)
        libopen-pal.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-pal.so.40 
(0x00007fa208431000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa20822d000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa208021000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00007fa20800b000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa207e03000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa207b01000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fa2078fd000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fa2076e7000)
        /lib64/ld-linux-x86-64.so.2 (0x000055e717175000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa2074d0000)
        libcap.so.2 => /lib64/libcap.so.2 (0x00007fa2072cb000)
        libdw.so.1 => /lib64/libdw.so.1 (0x00007fa207084000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00007fa206e7e000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007fa206c66000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa206a40000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fa20682f000)
I am process number 0
I am process number 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[borma.bis.pasteur.fr:08561] 1 more process has sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:08561] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

When I increased the number of MPI processes from 2 to 6 (the number of cores
of the desktop), only the openmpi-1.10.7 built version hang (killed with 
ctrl-c),
no errors with the 2.1.2 and 3.0.0 versions.

[tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do                 
                                                                                
                                                                                
     module purge && module add openmpi/$i; echo $i;  mpirun  -n 6 ./aa-$i ; 
done                                                                            
                                              
1.10.7
I am process number 0
I am process number 1
I am process number 2
I am process number 3
I am process number 4
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I am process number 5



^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

2.1.2
I am process number 2
I am process number 3
I am process number 4
I am process number 0
I am process number 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I am process number 5
[borma.bis.pasteur.fr:10542] 5 more processes have sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:10542] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
3.0.0
I am process number 2
I am process number 0
I am process number 3
I am process number 5
I am process number 4
I am process number 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[borma.bis.pasteur.fr:10570] 5 more processes have sent help message 
help-mpi-api.txt / mpi-abort
[borma.bis.pasteur.fr:10570] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages

-> some race condition on 1.10.7 ?
Cheers

Tru


-- 
Dr Tru Huynh | mailto:t...@pasteur.fr | tel/fax +33 1 45 68 87 37/19
https://research.pasteur.fr/en/team/structural-bioinformatics/
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France  
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to