Hello again, I am still not sure about breakpoints. But I did a "catch signal" in gdb, gdb's were attached to the two vasp processes and mpirun.
When the root rank exits I see in the gdb attaching to it [Thread 0x2b2787df8700 (LWP 2457) exited] [Thread 0x2b277f483180 (LWP 2455) exited] [Inferior 1 (process 2455) exited normally] In the gdb attached to the mpirun Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from /lib64/libc.so.6 In the gdb attached to the second rank I see no output. Issuing "continue" in the gdb session attached to mpi run does not lead to anything new as far as I can tell. The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is #0 0x00002b16560f769d in poll () from /lib64/libc.so.6 #1 0x00002b1654b3a496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 #2 0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 #3 0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at orterun.c:1071 #4 0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at main.c:13 So there is a signal and mpirun does nothing with it ? Cheers Christof On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote: > Hello, > > On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote: > > Christof, > > > > > > There is something really odd with this stack trace. > > count is zero, and some pointers do not point to valid addresses (!) > Yes, I assumed it was interesting :-) Note that the program is compiled > with -O2 -fp-model source, so optimization is on. I can try with -O0 > or the gcc/gfortran ( will take a moment) to make sure it is not a > problem from that. > > > > > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that > > the stack has been corrupted inside MPI_Allreduce(), or that you are not > > using the library you think you use > > pmap <pid> will show you which lib is used > The pmap of the survivor is at the very end of this mail. > > > > > btw, this was not started with > > mpirun --mca coll ^tuned ... > > right ? > This is correct, not started with "mpirun --mca coll ^tuned". Using it > does not change something. > > > > > just to make it clear ... > > a task from your program bluntly issues a fortran STOP, and this is kind of > > a feature. > Yes. The library where the stack occurs is/was written for serial use as > far as I can tell. As I mentioned, it is not our code but this one > http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which > should > be a working combination. > > > the *only* issue is mpirun does not kill the other MPI tasks and mpirun > > never completes. > > did i get it right ? > Yes ! So it is not a really big problem IMO. Just a bit nasty if this > would happen with a job in the queueing system. > > Best Regards > > Christof > > Note: git branch 2.0.2 of openmpi was configured and installed (make > install) with > ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model > precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise" > FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm > --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default > --prefix=/cluster/mpi/openmpi/2.0.2/intel2016 > > The OS is Centos 7, relatively current :-) with current Omni-Path driver > package from Intel (10.2). > > vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack > (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of > course the libwannier.a version 1.2 statically linked. > > pmap -p of the survivor > > 32282: /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca > 0000000000400000 65200K r-x-- > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca > 00000000045ab000 100K r---- > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca > 00000000045c4000 2244K rw--- > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca > 00000000047f5000 100900K rw--- [ anon ] > 000000000bfaa000 684K rw--- [ anon ] > 000000000c055000 20K rw--- [ anon ] > 000000000c05a000 424K rw--- [ anon ] > 000000000c0c4000 68K rw--- [ anon ] > 000000000c0d5000 25384K rw--- [ anon ] > 00002b17e34f6000 132K r-x-- /usr/lib64/ld-2.17.so > 00002b17e3517000 4K rw--- [ anon ] > 00002b17e3518000 28K rw-s- /dev/infiniband/uverbs0 > 00002b17e3523000 88K rw--- [ anon ] > 00002b17e3539000 772K rw-s- /dev/infiniband/uverbs0 > 00002b17e35fa000 772K rw-s- /dev/infiniband/uverbs0 > 00002b17e36bb000 196K rw-s- /dev/infiniband/uverbs0 > 00002b17e36ec000 28K rw-s- /dev/infiniband/uverbs0 > 00002b17e36f3000 20K rw-s- /dev/infiniband/uverbs0 > 00002b17e3717000 4K r---- /usr/lib64/ld-2.17.so > 00002b17e3718000 4K rw--- /usr/lib64/ld-2.17.so > 00002b17e3719000 4K rw--- [ anon ] > 00002b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so > 00002b17e3730000 2048K ----- /usr/lib64/libpthread-2.17.so > 00002b17e3930000 4K r---- /usr/lib64/libpthread-2.17.so > 00002b17e3931000 4K rw--- /usr/lib64/libpthread-2.17.so > 00002b17e3932000 16K rw--- [ anon ] > 00002b17e3936000 1028K r-x-- /usr/lib64/libm-2.17.so > 00002b17e3a37000 2044K ----- /usr/lib64/libm-2.17.so > 00002b17e3c36000 4K r---- /usr/lib64/libm-2.17.so > 00002b17e3c37000 4K rw--- /usr/lib64/libm-2.17.so > 00002b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so > 00002b17e3c3b000 2044K ----- /usr/lib64/libdl-2.17.so > 00002b17e3e3a000 4K r---- /usr/lib64/libdl-2.17.so > 00002b17e3e3b000 4K rw--- /usr/lib64/libdl-2.17.so > 00002b17e3e3c000 184K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 > 00002b17e3e6a000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 > 00002b17e4069000 4K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 > 00002b17e406a000 4K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0 > 00002b17e406b000 36K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 > 00002b17e4074000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 > 00002b17e4273000 4K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 > 00002b17e4274000 4K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0 > 00002b17e4275000 396K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 > 00002b17e42d8000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 > 00002b17e44d7000 4K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 > 00002b17e44d8000 4K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0 > 00002b17e44d9000 1948K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 > 00002b17e46c0000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 > 00002b17e48bf000 12K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 > 00002b17e48c2000 104K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1 > 00002b17e48dc000 76K rw--- [ anon ] > 00002b17e48ef000 948K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49dc000 4K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49dd000 12K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49e0000 4K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49e1000 20K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49e6000 8K r-x-- /usr/lib64/libc-2.17.so > 00002b17e49e8000 760K r-x-- /usr/lib64/libc-2.17.so > 00002b17e4aa6000 2048K ----- /usr/lib64/libc-2.17.so > 00002b17e4ca6000 16K r---- /usr/lib64/libc-2.17.so > 00002b17e4caa000 8K rw--- /usr/lib64/libc-2.17.so > 00002b17e4cac000 20K rw--- [ anon ] > 00002b17e4cb1000 84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 00002b17e4cc6000 2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 00002b17e4ec5000 4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 00002b17e4ec6000 4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1 > 00002b17e4ec7000 452K r-x-- /usr/lib64/libpsm2.so.2.1 > 00002b17e4f38000 2044K ----- /usr/lib64/libpsm2.so.2.1 > 00002b17e5137000 4K r---- /usr/lib64/libpsm2.so.2.1 > 00002b17e5138000 8K rw--- /usr/lib64/libpsm2.so.2.1 > 00002b17e513a000 4K rw--- [ anon ] > 00002b17e513b000 1344K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 > 00002b17e528b000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 > 00002b17e548a000 8K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 > 00002b17e548c000 44K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0 > 00002b17e5497000 12K rw--- [ anon ] > 00002b17e549a000 480K r-x-- /usr/lib64/libtorque.so.2.0.0 > 00002b17e5512000 2044K ----- /usr/lib64/libtorque.so.2.0.0 > 00002b17e5711000 8K r---- /usr/lib64/libtorque.so.2.0.0 > 00002b17e5713000 8K rw--- /usr/lib64/libtorque.so.2.0.0 > 00002b17e5715000 6704K rw--- [ anon ] > 00002b17e5da1000 1404K r-x-- /usr/lib64/libxml2.so.2.9.1 > 00002b17e5f00000 2044K ----- /usr/lib64/libxml2.so.2.9.1 > 00002b17e60ff000 32K r---- /usr/lib64/libxml2.so.2.9.1 > 00002b17e6107000 8K rw--- /usr/lib64/libxml2.so.2.9.1 > 00002b17e6109000 8K rw--- [ anon ] > 00002b17e610b000 84K r-x-- /usr/lib64/libz.so.1.2.7 > 00002b17e6120000 2044K ----- /usr/lib64/libz.so.1.2.7 > 00002b17e631f000 4K r---- /usr/lib64/libz.so.1.2.7 > 00002b17e6320000 4K rw--- /usr/lib64/libz.so.1.2.7 > 00002b17e6321000 1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e > 00002b17e64df000 2048K ----- /usr/lib64/libcrypto.so.1.0.1e > 00002b17e66df000 104K r---- /usr/lib64/libcrypto.so.1.0.1e > 00002b17e66f9000 48K rw--- /usr/lib64/libcrypto.so.1.0.1e > 00002b17e6705000 16K rw--- [ anon ] > 00002b17e6709000 396K r-x-- /usr/lib64/libssl.so.1.0.1e > 00002b17e676c000 2044K ----- /usr/lib64/libssl.so.1.0.1e > 00002b17e696b000 16K r---- /usr/lib64/libssl.so.1.0.1e > 00002b17e696f000 28K rw--- /usr/lib64/libssl.so.1.0.1e > 00002b17e6976000 1572K r-x-- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 > 00002b17e6aff000 2044K ----- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 > 00002b17e6cfe000 20K r---- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 > 00002b17e6d03000 56K rw--- > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0 > 00002b17e6d11000 552K rw--- [ anon ] > 00002b17e6d9b000 84K r-x-- /usr/lib64/librdmacm.so.1.0.0 > 00002b17e6db0000 2044K ----- /usr/lib64/librdmacm.so.1.0.0 > 00002b17e6faf000 4K r---- /usr/lib64/librdmacm.so.1.0.0 > 00002b17e6fb0000 4K rw--- /usr/lib64/librdmacm.so.1.0.0 > 00002b17e6fb1000 4K rw--- [ anon ] > 00002b17e6fb2000 68K r-x-- /usr/lib64/libibverbs.so.1.0.0 > 00002b17e6fc3000 2044K ----- /usr/lib64/libibverbs.so.1.0.0 > 00002b17e71c2000 4K r---- /usr/lib64/libibverbs.so.1.0.0 > 00002b17e71c3000 4K rw--- /usr/lib64/libibverbs.so.1.0.0 > 00002b17e71c4000 40K r-x-- /usr/lib64/libnuma.so.1 > 00002b17e71ce000 2048K ----- /usr/lib64/libnuma.so.1 > 00002b17e73ce000 4K r---- /usr/lib64/libnuma.so.1 > 00002b17e73cf000 4K rw--- /usr/lib64/libnuma.so.1 > 00002b17e73d0000 32K r-x-- /usr/lib64/libpciaccess.so.0.11.1 > 00002b17e73d8000 2048K ----- /usr/lib64/libpciaccess.so.0.11.1 > 00002b17e75d8000 4K r---- /usr/lib64/libpciaccess.so.0.11.1 > 00002b17e75d9000 4K rw--- /usr/lib64/libpciaccess.so.0.11.1 > 00002b17e75da000 28K r-x-- /usr/lib64/librt-2.17.so > 00002b17e75e1000 2044K ----- /usr/lib64/librt-2.17.so > 00002b17e77e0000 4K r---- /usr/lib64/librt-2.17.so > 00002b17e77e1000 4K rw--- /usr/lib64/librt-2.17.so > 00002b17e77e2000 8K r-x-- /usr/lib64/libutil-2.17.so > 00002b17e77e4000 2044K ----- /usr/lib64/libutil-2.17.so > 00002b17e79e3000 4K r---- /usr/lib64/libutil-2.17.so > 00002b17e79e4000 4K rw--- /usr/lib64/libutil-2.17.so > 00002b17e79e5000 152K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 > 00002b17e7a0b000 2044K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 > 00002b17e7c0a000 4K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 > 00002b17e7c0b000 8K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5 > 00002b17e7c0d000 24K rw--- [ anon ] > 00002b17e7c13000 1288K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 > 00002b17e7d55000 2044K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 > 00002b17e7f54000 12K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 > 00002b17e7f57000 12K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5 > 00002b17e7f5a000 116K rw--- [ anon ] > 00002b17e7f77000 2696K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so > 00002b17e8219000 2044K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so > 00002b17e8418000 24K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so > 00002b17e841e000 340K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so > 00002b17e8473000 420K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 > 00002b17e84dc000 2048K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 > 00002b17e86dc000 4K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 > 00002b17e86dd000 4K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5 > 00002b17e86de000 4K rw--- [ anon ] > 00002b17e86df000 13124K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so > 00002b17e93b0000 2048K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so > 00002b17e95b0000 220K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so > 00002b17e95e7000 20K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so > 00002b17e95ec000 1304K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 > 00002b17e9732000 2048K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 > 00002b17e9932000 12K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 > 00002b17e9935000 12K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5 > 00002b17e9938000 296K rw--- [ anon ] > 00002b17e9982000 1464K r-x-- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so > 00002b17e9af0000 2044K ----- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so > 00002b17e9cef000 4K r---- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so > 00002b17e9cf0000 16K rw--- > /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so > 00002b17e9cf4000 16K r-x-- /usr/lib64/libuuid.so.1.3.0 > 00002b17e9cf8000 2044K ----- /usr/lib64/libuuid.so.1.3.0 > 00002b17e9ef7000 4K r---- /usr/lib64/libuuid.so.1.3.0 > 00002b17e9ef8000 4K rw--- /usr/lib64/libuuid.so.1.3.0 > 00002b17e9ef9000 932K r-x-- /usr/lib64/libstdc++.so.6.0.19 > 00002b17e9fe2000 2048K ----- /usr/lib64/libstdc++.so.6.0.19 > 00002b17ea1e2000 32K r---- /usr/lib64/libstdc++.so.6.0.19 > 00002b17ea1ea000 8K rw--- /usr/lib64/libstdc++.so.6.0.19 > 00002b17ea1ec000 84K rw--- [ anon ] > 00002b17ea201000 144K r-x-- /usr/lib64/liblzma.so.5.0.99 > 00002b17ea225000 2044K ----- /usr/lib64/liblzma.so.5.0.99 > 00002b17ea424000 4K r---- /usr/lib64/liblzma.so.5.0.99 > 00002b17ea425000 4K rw--- /usr/lib64/liblzma.so.5.0.99 > 00002b17ea426000 292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2 > 00002b17ea46f000 2048K ----- /usr/lib64/libgssapi_krb5.so.2.2 > 00002b17ea66f000 4K r---- /usr/lib64/libgssapi_krb5.so.2.2 > 00002b17ea670000 8K rw--- /usr/lib64/libgssapi_krb5.so.2.2 > 00002b17ea672000 852K r-x-- /usr/lib64/libkrb5.so.3.3 > 00002b17ea747000 2048K ----- /usr/lib64/libkrb5.so.3.3 > 00002b17ea947000 52K r---- /usr/lib64/libkrb5.so.3.3 > 00002b17ea954000 12K rw--- /usr/lib64/libkrb5.so.3.3 > 00002b17ea957000 12K r-x-- /usr/lib64/libcom_err.so.2.1 > 00002b17ea95a000 2044K ----- /usr/lib64/libcom_err.so.2.1 > 00002b17eab59000 4K r---- /usr/lib64/libcom_err.so.2.1 > 00002b17eab5a000 4K rw--- /usr/lib64/libcom_err.so.2.1 > 00002b17eab5b000 188K r-x-- /usr/lib64/libk5crypto.so.3.1 > 00002b17eab8a000 2044K ----- /usr/lib64/libk5crypto.so.3.1 > 00002b17ead89000 8K r---- /usr/lib64/libk5crypto.so.3.1 > 00002b17ead8b000 4K rw--- /usr/lib64/libk5crypto.so.3.1 > 00002b17ead8c000 4K rw--- [ anon ] > 00002b17ead8d000 284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1 > 00002b17eadd4000 2044K ----- /usr/lib64/libnl-route-3.so.200.16.1 > 00002b17eafd3000 12K r---- /usr/lib64/libnl-route-3.so.200.16.1 > 00002b17eafd6000 16K rw--- /usr/lib64/libnl-route-3.so.200.16.1 > 00002b17eafda000 8K rw--- [ anon ] > 00002b17eafdc000 104K r-x-- /usr/lib64/libnl-3.so.200.16.1 > 00002b17eaff6000 2044K ----- /usr/lib64/libnl-3.so.200.16.1 > 00002b17eb1f5000 8K r---- /usr/lib64/libnl-3.so.200.16.1 > 00002b17eb1f7000 4K rw--- /usr/lib64/libnl-3.so.200.16.1 > 00002b17eb1f8000 52K r-x-- /usr/lib64/libkrb5support.so.0.1 > 00002b17eb205000 2048K ----- /usr/lib64/libkrb5support.so.0.1 > 00002b17eb405000 4K r---- /usr/lib64/libkrb5support.so.0.1 > 00002b17eb406000 4K rw--- /usr/lib64/libkrb5support.so.0.1 > 00002b17eb407000 12K r-x-- /usr/lib64/libkeyutils.so.1.5 > 00002b17eb40a000 2044K ----- /usr/lib64/libkeyutils.so.1.5 > 00002b17eb609000 4K r---- /usr/lib64/libkeyutils.so.1.5 > 00002b17eb60a000 4K rw--- /usr/lib64/libkeyutils.so.1.5 > 00002b17eb60b000 88K r-x-- /usr/lib64/libresolv-2.17.so > 00002b17eb621000 2048K ----- /usr/lib64/libresolv-2.17.so > 00002b17eb821000 4K r---- /usr/lib64/libresolv-2.17.so > 00002b17eb822000 4K rw--- /usr/lib64/libresolv-2.17.so > 00002b17eb823000 8K rw--- [ anon ] > 00002b17eb825000 132K r-x-- /usr/lib64/libselinux.so.1 > 00002b17eb846000 2048K ----- /usr/lib64/libselinux.so.1 > 00002b17eba46000 4K r---- /usr/lib64/libselinux.so.1 > 00002b17eba47000 4K rw--- /usr/lib64/libselinux.so.1 > 00002b17eba48000 8K rw--- [ anon ] > 00002b17eba4a000 384K r-x-- /usr/lib64/libpcre.so.1.2.0 > 00002b17ebaaa000 2044K ----- /usr/lib64/libpcre.so.1.2.0 > 00002b17ebca9000 4K r---- /usr/lib64/libpcre.so.1.2.0 > 00002b17ebcaa000 4K rw--- /usr/lib64/libpcre.so.1.2.0 > 00002b17ebcab000 4K ----- [ anon ] > 00002b17ebcac000 3352K rw--- [ anon ] > 00002b17ec000000 132K rw--- [ anon ] > 00002b17ec021000 65404K ----- [ anon ] > 00002b17f0000000 4K ----- [ anon ] > 00002b17f0001000 2048K rw--- [ anon ] > 00002b17f0201000 16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so > 00002b17f0205000 2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so > 00002b17f0404000 4K r---- /usr/lib64/libhfi1verbs-rdmav2.so > 00002b17f0405000 4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so > 00002b17f0406000 4K rw--- [ anon ] > 00002b17f0407000 4096K rw--- [ anon ] > 00002b17f0807000 1032K rw--- [ anon ] > 00002b17f0909000 4100K rw-s- > /tmp/openmpi-sessions-12001@node109_0/52426/1/1/vader_segment.node109.1 > 00002b17f0d0a000 4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200 > 00002b17f112d000 132K rw--- [ anon ] > 00002b17f114e000 4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 > (deleted) > 00002b17f1571000 8628K rw--- [ anon ] > 00002b17f4000000 132K rw--- [ anon ] > 00002b17f4021000 65404K ----- [ anon ] > 00002b17f9e85000 9164K rw--- [ anon ] > 00007ffd8b021000 31316K rw--- [ stack ] > 00007ffd8cfa4000 8K r-x-- [ anon ] > ffffffffff600000 4K r-x-- [ anon ] > total 539352K > > > > > > > Cheers, > > > > Gilles > > > > On Thursday, December 8, 2016, Christof Koehler < > > christof.koeh...@bccms.uni-bremen.de> wrote: > > > > > Hello everybody, > > > > > > I tried it with the nightly and the direct 2.0.2 branch from git which > > > according to the log should contain that patch > > > > > > commit d0b97d7a408b87425ca53523de369da405358ba2 > > > Merge: ac8c019 b9420bb > > > Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>> > > > Date: Wed Dec 7 18:24:46 2016 -0500 > > > Merge pull request #2528 from rhc54/cmr20x/signals > > > > > > Unfortunately it changes nothing. The root rank stops and all other > > > ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting > > > apparently in that allreduce. The stack trace looks a bit more > > > interesting (git is always debug build ?), so I include it at the very > > > bottom just in case. > > > > > > Off-list Gilles Gouaillardet suggested to set breakpoints at exit, > > > __exit etc. to try to catch signals. Would that be useful ? I need a > > > moment to figure out how to do this, but I can definitively try. > > > > > > Some remark: During "make install" from the git repo I see a > > > > > > WARNING! Common symbols found: > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex > > > mpi-f08-types.o: 0000000000000004 C > > > ompi_f08_mpi_2double_precision > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor > > > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte > > > > > > I have never noticed this before. > > > > > > > > > Best Regards > > > > > > Christof > > > > > > Thread 1 (Thread 0x2af84cde4840 (LWP 11219)): > > > #0 0x00002af84e4c669d in poll () from /lib64/libc.so.6 > > > #1 0x00002af850517496 in poll_dispatch () from > > > /cluster/mpi/openmpi/2.0.2/ > > > intel2016/lib/libopen-pal.so.20 > > > #2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from > > > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 > > > #3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207 > > > #4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, > > > requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80 > > > #5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling > > > (sbuf=0xdecbae0, > > > rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, > > > module=0xdee69e0) at base/coll_base_allreduce.c:225 > > > #6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed > > > (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, > > > comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66 > > > #7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, > > > count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at > > > pallreduce.c:107 > > > #8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", > > > recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, > > > datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at > > > pallreduce_f.c:87 > > > #9 0x000000000045ecc6 in m_sum_i_ () > > > #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ () > > > #11 0x00000000004325ff in vamp () at main.F:2640 > > > #12 0x000000000040de1e in main () > > > #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6 > > > #14 0x000000000040dd29 in _start () > > > > > > On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;> > > > wrote: > > > > Hi Christof > > > > > > > > Sorry if I missed this, but it sounds like you are saying that one of > > > your procs abnormally terminates, and we are failing to kill the remaining > > > job? Is that correct? > > > > > > > > If so, I just did some work that might relate to that problem that is > > > pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 < > > > https://github.com/open-mpi/ompi/pull/2528> > > > > > > > > Would you be able to try that? > > > > > > > > Ralph > > > > > > > > > On Dec 7, 2016, at 9:37 AM, Christof Koehler < > > > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: > > > > > > > > > > Hello, > > > > > > > > > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: > > > > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler < > > > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: > > > > >>>> > > > > >>> I really think the hang is a consequence of > > > > >>> unclean termination (in the sense that the non-root ranks are not > > > > >>> terminated) and probably not the cause, in my interpretation of what > > > I > > > > >>> see. Would you have any suggestion to catch signals sent between > > > orterun > > > > >>> (mpirun) and the child tasks ? > > > > >> > > > > >> Do you know where in the code the termination call is? Is it > > > actually calling mpi_abort(), or just doing something ugly like calling > > > fortran “stop”? If the latter, would that explain a possible hang? > > > > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The > > > wannier90 input contains > > > > > an error, a restart is requested and the wannier90.chk file the > > > > > restart > > > > > information is missing. > > > > > " > > > > > Exiting....... > > > > > Error: restart requested but wannier90.chk file not found > > > > > " > > > > > So it must terminate. > > > > > > > > > > The termination happens in the libwannier.a, source file io.F90: > > > > > > > > > > write(stdout,*) 'Exiting.......' > > > > > write(stdout, '(1x,a)') trim(error_msg) > > > > > close(stdout) > > > > > stop "wannier90 error: examine the output/error file for details" > > > > > > > > > > So it calls stop as you assumed. > > > > > > > > > >> Presumably someone here can comment on what the standard says about > > > the validity of terminating without mpi_abort. > > > > > > > > > > Well, probably stop is not a good way to terminate then. > > > > > > > > > > My main point was the change relative to 1.10 anyway :-) > > > > > > > > > > > > > > >> > > > > >> Actually, if you’re willing to share enough input files to reproduce, > > > I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix > > > a crash that was apparently addressed by some change in the memory > > > allocator in a recent version of openmpi. Just e-mail me if that’s the > > > case. > > > > > > > > > > I think that is no longer necessary ? In principle it is no problem > > > > > but > > > > > it at the end of a (small) GW calculation, the Si tutorial example. > > > > > So the mail would be abit larger due to the WAVECAR. > > > > > > > > > > > > > > >> > > > > >> > > > Noam > > > > >> > > > > >> > > > > >> ____________ > > > > >> || > > > > >> |U.S. NAVAL| > > > > >> |_RESEARCH_| > > > > >> LABORATORY > > > > >> Noam Bernstein, Ph.D. > > > > >> Center for Materials Physics and Technology > > > > >> U.S. Naval Research Laboratory > > > > >> T +1 202 404 8628 F +1 202 404 7546 > > > > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> > > > > > > > > > > -- > > > > > Dr. rer. nat. Christof Köhler email: > > > c.koeh...@bccms.uni-bremen.de <javascript:;> > > > > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > > > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > > > > 28359 Bremen > > > > > > > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > > _______________________________________________ > > > > > users mailing list > > > > > users@lists.open-mpi.org <javascript:;> > > > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > > > > > > > -- > > > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > > > <javascript:;> > > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > > 28359 Bremen > > > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users