Hi, On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote: > Hello, > > In debugging a test of an application, I recently came across odd behavior > for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was > acknowledged by the process output, the mpirun process failed to exit. I > was able to duplicate this behavior on multiple machines with OpenMPI > versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program: > > #include <mpi.h> > #include <stdio.h> > #include <unistd.h> > #include <stdbool.h> > > int main(int argc, char **argv) > { > int rank; > > MPI_Init(&argc,&argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > printf("I am process number %d\n", rank); > MPI_Abort(MPI_COMM_WORLD, 3); > return 0; > } > > Is this a bug or a feature? Does this behavior exist in OpenMPI versions > 2.0 and 3.0? I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and 3.0.0 and the program seems to run fine.
[tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do module purge && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; mpirun -n 2 ./aa-$i ; done linux-vdso.so.1 => (0x00007ffe115bd000) libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 (0x00007f40d7b4a000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f40d78f7000) libc.so.6 => /lib64/libc.so.6 (0x00007f40d7534000) libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 (0x00007f40d72b8000) libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 (0x00007f40d6fd9000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f40d6dcd000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f40d6bc9000) librt.so.1 => /lib64/librt.so.1 (0x00007f40d69c0000) libm.so.6 => /lib64/libm.so.6 (0x00007f40d66be000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f40d64bb000) /lib64/ld-linux-x86-64.so.2 (0x000055f6d96c4000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f40d62a4000) I am process number 1 I am process number 0 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [borma.bis.pasteur.fr:08511] 1 more process has sent help message help-mpi-api.txt / mpi-abort [borma.bis.pasteur.fr:08511] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages linux-vdso.so.1 => (0x00007fffaabcd000) libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 (0x00007f5bcee39000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5bcebe6000) libc.so.6 => /lib64/libc.so.6 (0x00007f5bce823000) libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 (0x00007f5bce5a0000) libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 (0x00007f5bce2a7000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f5bce0a3000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f5bcde97000) libudev.so.1 => /lib64/libudev.so.1 (0x00007f5bcde81000) librt.so.1 => /lib64/librt.so.1 (0x00007f5bcdc79000) libm.so.6 => /lib64/libm.so.6 (0x00007f5bcd977000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f5bcd773000) /lib64/ld-linux-x86-64.so.2 (0x000055718df01000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5bcd55d000) libcap.so.2 => /lib64/libcap.so.2 (0x00007f5bcd357000) libdw.so.1 => /lib64/libdw.so.1 (0x00007f5bcd110000) libattr.so.1 => /lib64/libattr.so.1 (0x00007f5bccf0b000) libelf.so.1 => /lib64/libelf.so.1 (0x00007f5bcccf2000) libz.so.1 => /lib64/libz.so.1 (0x00007f5bccadc000) liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f5bcc8b6000) libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f5bcc6a5000) I am process number 1 I am process number 0 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [borma.bis.pasteur.fr:08534] 1 more process has sent help message help-mpi-api.txt / mpi-abort [borma.bis.pasteur.fr:08534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages linux-vdso.so.1 => (0x00007ffc09585000) libmpi.so.40 => /c7/shared/openmpi/3.0.0/lib/libmpi.so.40 (0x00007fa208ffc000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa208da9000) libc.so.6 => /lib64/libc.so.6 (0x00007fa2089e6000) libopen-rte.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-rte.so.40 (0x00007fa208734000) libopen-pal.so.40 => /c7/shared/openmpi/3.0.0/lib/libopen-pal.so.40 (0x00007fa208431000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fa20822d000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa208021000) libudev.so.1 => /lib64/libudev.so.1 (0x00007fa20800b000) librt.so.1 => /lib64/librt.so.1 (0x00007fa207e03000) libm.so.6 => /lib64/libm.so.6 (0x00007fa207b01000) libutil.so.1 => /lib64/libutil.so.1 (0x00007fa2078fd000) libz.so.1 => /lib64/libz.so.1 (0x00007fa2076e7000) /lib64/ld-linux-x86-64.so.2 (0x000055e717175000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa2074d0000) libcap.so.2 => /lib64/libcap.so.2 (0x00007fa2072cb000) libdw.so.1 => /lib64/libdw.so.1 (0x00007fa207084000) libattr.so.1 => /lib64/libattr.so.1 (0x00007fa206e7e000) libelf.so.1 => /lib64/libelf.so.1 (0x00007fa206c66000) liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa206a40000) libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fa20682f000) I am process number 0 I am process number 1 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [borma.bis.pasteur.fr:08561] 1 more process has sent help message help-mpi-api.txt / mpi-abort [borma.bis.pasteur.fr:08561] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages When I increased the number of MPI processes from 2 to 6 (the number of cores of the desktop), only the openmpi-1.10.7 built version hang (killed with ctrl-c), no errors with the 2.1.2 and 3.0.0 versions. [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do module purge && module add openmpi/$i; echo $i; mpirun -n 6 ./aa-$i ; done 1.10.7 I am process number 0 I am process number 1 I am process number 2 I am process number 3 I am process number 4 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- I am process number 5 ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate 2.1.2 I am process number 2 I am process number 3 I am process number 4 I am process number 0 I am process number 1 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- I am process number 5 [borma.bis.pasteur.fr:10542] 5 more processes have sent help message help-mpi-api.txt / mpi-abort [borma.bis.pasteur.fr:10542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 3.0.0 I am process number 2 I am process number 0 I am process number 3 I am process number 5 I am process number 4 I am process number 1 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode 3. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- [borma.bis.pasteur.fr:10570] 5 more processes have sent help message help-mpi-api.txt / mpi-abort [borma.bis.pasteur.fr:10570] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages -> some race condition on 1.10.7 ? Cheers Tru -- Dr Tru Huynh | mailto:t...@pasteur.fr | tel/fax +33 1 45 68 87 37/19 https://research.pasteur.fr/en/team/structural-bioinformatics/ Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users