[OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls
Hello, In debugging a test of an application, I recently came across odd behavior for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was acknowledged by the process output, the mpirun process failed to exit. I was able to duplicate this behavior on multiple machines with OpenMPI versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program: #include #include #include #include int main(int argc, char **argv) { int rank; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("I am process number %d\n", rank); MPI_Abort(MPI_COMM_WORLD, 3); return 0; } Is this a bug or a feature? Does this behavior exist in OpenMPI versions 2.0 and 3.0? Best, Nik ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls
Thank you for the replies. Do I understand correctly that since OpenMPI v1.10 is no longer supported, I am unlikely to see a bug fix for this without moving to v2.x or v3.x? I am dealing with clusters where the administrators may be loathe to update packages until it is absolutely necessary, and want to present them with a complete outlook on the problem. Thanks, Nik 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org : > Glad to hear it has already been fixed :-) > > Thanks! > > On Nov 7, 2017, at 4:13 PM, Tru Huynh wrote: > > Hi, > > On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote: > > Hello, > > In debugging a test of an application, I recently came across odd behavior > for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was > acknowledged by the process output, the mpirun process failed to exit. I > was able to duplicate this behavior on multiple machines with OpenMPI > versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program: > > #include > #include > #include > #include > > int main(int argc, char **argv) > { >int rank; > >MPI_Init(&argc,&argv); >MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >printf("I am process number %d\n", rank); >MPI_Abort(MPI_COMM_WORLD, 3); >return 0; > } > > Is this a bug or a feature? Does this behavior exist in OpenMPI versions > 2.0 and 3.0? > > I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and > 3.0.0 and the program seems to run fine. > > [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do > > > module purge > && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; mpirun -n 2 > ./aa-$i ; done > > > linux-vdso.so.1 => (0x7ffe115bd000) > libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 > (0x7f40d7b4a000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000) > libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000) > libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 > (0x7f40d72b8000) > libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 > (0x7f40d6fd9000) > libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000) > libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000) > librt.so.1 => /lib64/librt.so.1 (0x7f40d69c) > libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000) > libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000) > /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000) > I am process number 1 > I am process number 0 > -- > MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD > with errorcode 3. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -- > [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] 1 more > process has sent help message help-mpi-api.txt / mpi-abort > [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] Set MCA > parameter "orte_base_help_aggregate" to 0 to see all help / error messages > linux-vdso.so.1 => (0x7fffaabcd000) > libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 > (0x7f5bcee39000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000) > libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000) > libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 > (0x7f5bce5a) > libopen-pal.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-pal.so.20 > (0x7f5bce2a7000) > libdl.so.2 => /lib64/libdl.so.2 (0x7f5bce0a3000) > libnuma.so.1 => /lib64/libnuma.so.1 (0x7f5bcde97000) > libudev.so.1 => /lib64/libudev.so.1 (0x7f5bcde81000) > librt.so.1 => /lib64/librt.so.1 (0x7f5bcdc79000) > libm.so.6 => /lib64/libm.so.6 (0x7f5bcd977000) > libutil.so.1 => /lib64/libutil.so.1 (0x7f5bcd773000) > /lib64/ld-linux-x86-64.so.2 (0x55718df01000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f5bcd55d000) > libcap.so.2 => /lib64/libcap.so.2 (0x7f5bcd357000) > libdw.so.1 => /lib64/libdw.so.1 (0x7f5bcd11) > libattr.so.1 => /lib64/libattr.so.1 (0x7f5bccf0b000) > libelf.so.1 => /lib64/libelf.so.1 (0x7f5bcccf2000) > libz.so.1 => /lib64/libz.so.1 (0x7f5bccadc000) > liblzma.so.5 => /lib64/liblzma
Re: [OMPI users] OpenMPI 1.10.x handling of simultaneous MPI_Abort calls
That was not my interpretation. His message said he did not observe the race condition for 2 processes, but did for 6 processes. I observe a failure to exit mpirun around 25-30% of the time with 2 processes, causing an inconsistent hang in both my example program and my larger application. -Nik On Nov 8, 2017 11:40, "r...@open-mpi.org" wrote: > According to the other reporter, it has been fixed in 1.10.7. I haven’t > verified that, but I’d suggest trying it first. > > > On Nov 8, 2017, at 8:26 AM, Nikolas Antolin wrote: > > Thank you for the replies. Do I understand correctly that since OpenMPI > v1.10 is no longer supported, I am unlikely to see a bug fix for this > without moving to v2.x or v3.x? I am dealing with clusters where the > administrators may be loathe to update packages until it is absolutely > necessary, and want to present them with a complete outlook on the problem. > > Thanks, > Nik > > 2017-11-07 19:00 GMT-07:00 r...@open-mpi.org : > >> Glad to hear it has already been fixed :-) >> >> Thanks! >> >> On Nov 7, 2017, at 4:13 PM, Tru Huynh wrote: >> >> Hi, >> >> On Tue, Nov 07, 2017 at 02:05:20PM -0700, Nikolas Antolin wrote: >> >> Hello, >> >> In debugging a test of an application, I recently came across odd behavior >> for simultaneous MPI_Abort calls. Namely, while the MPI_Abort was >> acknowledged by the process output, the mpirun process failed to exit. I >> was able to duplicate this behavior on multiple machines with OpenMPI >> versions 1.10.2, 1.10.5, and 1.10.6 with the following simple program: >> >> #include >> #include >> #include >> #include >> >> int main(int argc, char **argv) >> { >>int rank; >> >>MPI_Init(&argc,&argv); >>MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> >>printf("I am process number %d\n", rank); >>MPI_Abort(MPI_COMM_WORLD, 3); >>return 0; >> } >> >> Is this a bug or a feature? Does this behavior exist in OpenMPI versions >> 2.0 and 3.0? >> >> I compiled your test case on CentOS-7 with openmpi 1.10.7/2.1.2 and >> 3.0.0 and the program seems to run fine. >> >> [tru@borma openmpi-test-abort]$ for i in 1.10.7 2.1.2 3.0.0; do >> >> >> module >> purge && module add openmpi/$i && mpicc aa.c -o aa-$i && ldd aa-$i; mpirun >> -n 2 ./aa-$i ; done >> >> >> linux-vdso.so.1 => (0x7ffe115bd000) >> libmpi.so.12 => /c7/shared/openmpi/1.10.7/lib/libmpi.so.12 >> (0x7f40d7b4a000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f40d78f7000) >> libc.so.6 => /lib64/libc.so.6 (0x7f40d7534000) >> libopen-rte.so.12 => /c7/shared/openmpi/1.10.7/lib/libopen-rte.so.12 >> (0x7f40d72b8000) >> libopen-pal.so.13 => /c7/shared/openmpi/1.10.7/lib/libopen-pal.so.13 >> (0x7f40d6fd9000) >> libnuma.so.1 => /lib64/libnuma.so.1 (0x7f40d6dcd000) >> libdl.so.2 => /lib64/libdl.so.2 (0x7f40d6bc9000) >> librt.so.1 => /lib64/librt.so.1 (0x7f40d69c) >> libm.so.6 => /lib64/libm.so.6 (0x7f40d66be000) >> libutil.so.1 => /lib64/libutil.so.1 (0x7f40d64bb000) >> /lib64/ld-linux-x86-64.so.2 (0x55f6d96c4000) >> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f40d62a4000) >> I am process number 1 >> I am process number 0 >> >> -- >> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD >> with errorcode 3. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> >> -- >> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] 1 more >> process has sent help message help-mpi-api.txt / mpi-abort >> [borma.bis.pasteur.fr:08511 <http://borma.bis.pasteur.fr:8511/>] Set MCA >> parameter "orte_base_help_aggregate" to 0 to see all help / error messages >> linux-vdso.so.1 => (0x7fffaabcd000) >> libmpi.so.20 => /c7/shared/openmpi/2.1.2/lib/libmpi.so.20 >> (0x7f5bcee39000) >> libpthread.so.0 => /lib64/libpthread.so.0 (0x7f5bcebe6000) >> libc.so.6 => /lib64/libc.so.6 (0x7f5bce823000) >> libopen-rte.so.20 => /c7/shared/openmpi/2.1.2/lib/libopen-rte.so.20 >> (0x7f5bce5a)