Hello  again,

I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.

When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior 1 (process 2455) exited normally]

In the gdb attached to the mpirun
Catchpoint 1 (signal SIGCHLD), 0x00002b16560f769d in poll () from
/lib64/libc.so.6

In the gdb attached to the second rank I see no output.

Issuing "continue" in the gdb session attached to mpi run does not lead
to anything new as far as I can tell.

The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
#0  0x00002b16560f769d in poll () from /lib64/libc.so.6
#1  0x00002b1654b3a496 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x00002b1654b32fa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x0000000000406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
orterun.c:1071
#4  0x00000000004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
main.c:13

So there is a signal and mpirun does nothing with it ?

Cheers

Christof


On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
> Hello,
> 
> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> > Christof,
> > 
> > 
> > There is something really odd with this stack trace.
> > count is zero, and some pointers do not point to valid addresses (!)
> Yes, I assumed it was interesting :-) Note that the program is compiled
> with   -O2 -fp-model source, so optimization is on. I can try with -O0
> or the gcc/gfortran ( will take a moment) to make sure it is not a
> problem from that.
> 
> > 
> > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> > the stack has been corrupted inside MPI_Allreduce(), or that you are not
> > using the library you think you use
> > pmap <pid> will show you which lib is used
> The pmap of the survivor is at the very end of this mail.
> 
> > 
> > btw, this was not started with
> > mpirun --mca coll ^tuned ...
> > right ?
> This is correct, not started with "mpirun --mca coll ^tuned". Using it
> does not change something.
> 
> > 
> > just to make it clear ...
> > a task from your program bluntly issues a fortran STOP, and this is kind of
> > a feature.
> Yes. The library where the stack occurs is/was written for serial use as
> far as I can tell. As I mentioned, it is not our code but this one
> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
> should
> be a working combination.
> 
> > the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> > never completes.
> > did i get it right ?
> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
> would happen with a job in the queueing system.
> 
> Best Regards
> 
> Christof
> 
> Note: git branch 2.0.2 of openmpi was configured and installed (make
> install) with
> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
> 
> The OS is Centos 7, relatively current :-) with current Omni-Path driver
> package from Intel (10.2).
> 
> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
> course the libwannier.a version 1.2 statically linked.
> 
> pmap -p of the survivor
> 
> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 0000000000400000  65200K r-x-- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 00000000045ab000    100K r---- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 00000000045c4000   2244K rw--- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 00000000047f5000 100900K rw---   [ anon ]
> 000000000bfaa000    684K rw---   [ anon ]
> 000000000c055000     20K rw---   [ anon ]
> 000000000c05a000    424K rw---   [ anon ]
> 000000000c0c4000     68K rw---   [ anon ]
> 000000000c0d5000  25384K rw---   [ anon ]
> 00002b17e34f6000    132K r-x-- /usr/lib64/ld-2.17.so
> 00002b17e3517000      4K rw---   [ anon ]
> 00002b17e3518000     28K rw-s- /dev/infiniband/uverbs0
> 00002b17e3523000     88K rw---   [ anon ]
> 00002b17e3539000    772K rw-s- /dev/infiniband/uverbs0
> 00002b17e35fa000    772K rw-s- /dev/infiniband/uverbs0
> 00002b17e36bb000    196K rw-s- /dev/infiniband/uverbs0
> 00002b17e36ec000     28K rw-s- /dev/infiniband/uverbs0
> 00002b17e36f3000     20K rw-s- /dev/infiniband/uverbs0
> 00002b17e3717000      4K r---- /usr/lib64/ld-2.17.so
> 00002b17e3718000      4K rw--- /usr/lib64/ld-2.17.so
> 00002b17e3719000      4K rw---   [ anon ]
> 00002b17e371a000     88K r-x-- /usr/lib64/libpthread-2.17.so
> 00002b17e3730000   2048K ----- /usr/lib64/libpthread-2.17.so
> 00002b17e3930000      4K r---- /usr/lib64/libpthread-2.17.so
> 00002b17e3931000      4K rw--- /usr/lib64/libpthread-2.17.so
> 00002b17e3932000     16K rw---   [ anon ]
> 00002b17e3936000   1028K r-x-- /usr/lib64/libm-2.17.so
> 00002b17e3a37000   2044K ----- /usr/lib64/libm-2.17.so
> 00002b17e3c36000      4K r---- /usr/lib64/libm-2.17.so
> 00002b17e3c37000      4K rw--- /usr/lib64/libm-2.17.so
> 00002b17e3c38000     12K r-x-- /usr/lib64/libdl-2.17.so
> 00002b17e3c3b000   2044K ----- /usr/lib64/libdl-2.17.so
> 00002b17e3e3a000      4K r---- /usr/lib64/libdl-2.17.so
> 00002b17e3e3b000      4K rw--- /usr/lib64/libdl-2.17.so
> 00002b17e3e3c000    184K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
> 00002b17e3e6a000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
> 00002b17e4069000      4K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
> 00002b17e406a000      4K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
> 00002b17e406b000     36K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
> 00002b17e4074000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
> 00002b17e4273000      4K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
> 00002b17e4274000      4K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
> 00002b17e4275000    396K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
> 00002b17e42d8000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
> 00002b17e44d7000      4K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
> 00002b17e44d8000      4K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_mpifh.so.20.0.0
> 00002b17e44d9000   1948K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
> 00002b17e46c0000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
> 00002b17e48bf000     12K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
> 00002b17e48c2000    104K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi.so.20.0.1
> 00002b17e48dc000     76K rw---   [ anon ]
> 00002b17e48ef000    948K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49dc000      4K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49dd000     12K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49e0000      4K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49e1000     20K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49e6000      8K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e49e8000    760K r-x-- /usr/lib64/libc-2.17.so
> 00002b17e4aa6000   2048K ----- /usr/lib64/libc-2.17.so
> 00002b17e4ca6000     16K r---- /usr/lib64/libc-2.17.so
> 00002b17e4caa000      8K rw--- /usr/lib64/libc-2.17.so
> 00002b17e4cac000     20K rw---   [ anon ]
> 00002b17e4cb1000     84K r-x-- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 00002b17e4cc6000   2044K ----- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 00002b17e4ec5000      4K r---- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 00002b17e4ec6000      4K rw--- /usr/lib64/libgcc_s-4.8.5-20150702.so.1
> 00002b17e4ec7000    452K r-x-- /usr/lib64/libpsm2.so.2.1
> 00002b17e4f38000   2044K ----- /usr/lib64/libpsm2.so.2.1
> 00002b17e5137000      4K r---- /usr/lib64/libpsm2.so.2.1
> 00002b17e5138000      8K rw--- /usr/lib64/libpsm2.so.2.1
> 00002b17e513a000      4K rw---   [ anon ]
> 00002b17e513b000   1344K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
> 00002b17e528b000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
> 00002b17e548a000      8K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
> 00002b17e548c000     44K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-rte.so.20.0.0
> 00002b17e5497000     12K rw---   [ anon ]
> 00002b17e549a000    480K r-x-- /usr/lib64/libtorque.so.2.0.0
> 00002b17e5512000   2044K ----- /usr/lib64/libtorque.so.2.0.0
> 00002b17e5711000      8K r---- /usr/lib64/libtorque.so.2.0.0
> 00002b17e5713000      8K rw--- /usr/lib64/libtorque.so.2.0.0
> 00002b17e5715000   6704K rw---   [ anon ]
> 00002b17e5da1000   1404K r-x-- /usr/lib64/libxml2.so.2.9.1
> 00002b17e5f00000   2044K ----- /usr/lib64/libxml2.so.2.9.1
> 00002b17e60ff000     32K r---- /usr/lib64/libxml2.so.2.9.1
> 00002b17e6107000      8K rw--- /usr/lib64/libxml2.so.2.9.1
> 00002b17e6109000      8K rw---   [ anon ]
> 00002b17e610b000     84K r-x-- /usr/lib64/libz.so.1.2.7
> 00002b17e6120000   2044K ----- /usr/lib64/libz.so.1.2.7
> 00002b17e631f000      4K r---- /usr/lib64/libz.so.1.2.7
> 00002b17e6320000      4K rw--- /usr/lib64/libz.so.1.2.7
> 00002b17e6321000   1784K r-x-- /usr/lib64/libcrypto.so.1.0.1e
> 00002b17e64df000   2048K ----- /usr/lib64/libcrypto.so.1.0.1e
> 00002b17e66df000    104K r---- /usr/lib64/libcrypto.so.1.0.1e
> 00002b17e66f9000     48K rw--- /usr/lib64/libcrypto.so.1.0.1e
> 00002b17e6705000     16K rw---   [ anon ]
> 00002b17e6709000    396K r-x-- /usr/lib64/libssl.so.1.0.1e
> 00002b17e676c000   2044K ----- /usr/lib64/libssl.so.1.0.1e
> 00002b17e696b000     16K r---- /usr/lib64/libssl.so.1.0.1e
> 00002b17e696f000     28K rw--- /usr/lib64/libssl.so.1.0.1e
> 00002b17e6976000   1572K r-x-- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
> 00002b17e6aff000   2044K ----- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
> 00002b17e6cfe000     20K r---- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
> 00002b17e6d03000     56K rw--- 
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20.1.0
> 00002b17e6d11000    552K rw---   [ anon ]
> 00002b17e6d9b000     84K r-x-- /usr/lib64/librdmacm.so.1.0.0
> 00002b17e6db0000   2044K ----- /usr/lib64/librdmacm.so.1.0.0
> 00002b17e6faf000      4K r---- /usr/lib64/librdmacm.so.1.0.0
> 00002b17e6fb0000      4K rw--- /usr/lib64/librdmacm.so.1.0.0
> 00002b17e6fb1000      4K rw---   [ anon ]
> 00002b17e6fb2000     68K r-x-- /usr/lib64/libibverbs.so.1.0.0
> 00002b17e6fc3000   2044K ----- /usr/lib64/libibverbs.so.1.0.0
> 00002b17e71c2000      4K r---- /usr/lib64/libibverbs.so.1.0.0
> 00002b17e71c3000      4K rw--- /usr/lib64/libibverbs.so.1.0.0
> 00002b17e71c4000     40K r-x-- /usr/lib64/libnuma.so.1
> 00002b17e71ce000   2048K ----- /usr/lib64/libnuma.so.1
> 00002b17e73ce000      4K r---- /usr/lib64/libnuma.so.1
> 00002b17e73cf000      4K rw--- /usr/lib64/libnuma.so.1
> 00002b17e73d0000     32K r-x-- /usr/lib64/libpciaccess.so.0.11.1
> 00002b17e73d8000   2048K ----- /usr/lib64/libpciaccess.so.0.11.1
> 00002b17e75d8000      4K r---- /usr/lib64/libpciaccess.so.0.11.1
> 00002b17e75d9000      4K rw--- /usr/lib64/libpciaccess.so.0.11.1
> 00002b17e75da000     28K r-x-- /usr/lib64/librt-2.17.so
> 00002b17e75e1000   2044K ----- /usr/lib64/librt-2.17.so
> 00002b17e77e0000      4K r---- /usr/lib64/librt-2.17.so
> 00002b17e77e1000      4K rw--- /usr/lib64/librt-2.17.so
> 00002b17e77e2000      8K r-x-- /usr/lib64/libutil-2.17.so
> 00002b17e77e4000   2044K ----- /usr/lib64/libutil-2.17.so
> 00002b17e79e3000      4K r---- /usr/lib64/libutil-2.17.so
> 00002b17e79e4000      4K rw--- /usr/lib64/libutil-2.17.so
> 00002b17e79e5000    152K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
> 00002b17e7a0b000   2044K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
> 00002b17e7c0a000      4K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
> 00002b17e7c0b000      8K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifport.so.5
> 00002b17e7c0d000     24K rw---   [ anon ]
> 00002b17e7c13000   1288K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
> 00002b17e7d55000   2044K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
> 00002b17e7f54000     12K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
> 00002b17e7f57000     12K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcore.so.5
> 00002b17e7f5a000    116K rw---   [ anon ]
> 00002b17e7f77000   2696K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
> 00002b17e8219000   2044K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
> 00002b17e8418000     24K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
> 00002b17e841e000    340K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libimf.so
> 00002b17e8473000    420K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
> 00002b17e84dc000   2048K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
> 00002b17e86dc000      4K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
> 00002b17e86dd000      4K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libintlc.so.5
> 00002b17e86de000      4K rw---   [ anon ]
> 00002b17e86df000  13124K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
> 00002b17e93b0000   2048K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
> 00002b17e95b0000    220K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
> 00002b17e95e7000     20K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libsvml.so
> 00002b17e95ec000   1304K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
> 00002b17e9732000   2048K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
> 00002b17e9932000     12K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
> 00002b17e9935000     12K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libifcoremt.so.5
> 00002b17e9938000    296K rw---   [ anon ]
> 00002b17e9982000   1464K r-x-- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
> 00002b17e9af0000   2044K ----- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
> 00002b17e9cef000      4K r---- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
> 00002b17e9cf0000     16K rw--- 
> /cluster/intel/compilers_and_libraries_2016.4.258/linux/compiler/lib/intel64_lin/libirng.so
> 00002b17e9cf4000     16K r-x-- /usr/lib64/libuuid.so.1.3.0
> 00002b17e9cf8000   2044K ----- /usr/lib64/libuuid.so.1.3.0
> 00002b17e9ef7000      4K r---- /usr/lib64/libuuid.so.1.3.0
> 00002b17e9ef8000      4K rw--- /usr/lib64/libuuid.so.1.3.0
> 00002b17e9ef9000    932K r-x-- /usr/lib64/libstdc++.so.6.0.19
> 00002b17e9fe2000   2048K ----- /usr/lib64/libstdc++.so.6.0.19
> 00002b17ea1e2000     32K r---- /usr/lib64/libstdc++.so.6.0.19
> 00002b17ea1ea000      8K rw--- /usr/lib64/libstdc++.so.6.0.19
> 00002b17ea1ec000     84K rw---   [ anon ]
> 00002b17ea201000    144K r-x-- /usr/lib64/liblzma.so.5.0.99
> 00002b17ea225000   2044K ----- /usr/lib64/liblzma.so.5.0.99
> 00002b17ea424000      4K r---- /usr/lib64/liblzma.so.5.0.99
> 00002b17ea425000      4K rw--- /usr/lib64/liblzma.so.5.0.99
> 00002b17ea426000    292K r-x-- /usr/lib64/libgssapi_krb5.so.2.2
> 00002b17ea46f000   2048K ----- /usr/lib64/libgssapi_krb5.so.2.2
> 00002b17ea66f000      4K r---- /usr/lib64/libgssapi_krb5.so.2.2
> 00002b17ea670000      8K rw--- /usr/lib64/libgssapi_krb5.so.2.2
> 00002b17ea672000    852K r-x-- /usr/lib64/libkrb5.so.3.3
> 00002b17ea747000   2048K ----- /usr/lib64/libkrb5.so.3.3
> 00002b17ea947000     52K r---- /usr/lib64/libkrb5.so.3.3
> 00002b17ea954000     12K rw--- /usr/lib64/libkrb5.so.3.3
> 00002b17ea957000     12K r-x-- /usr/lib64/libcom_err.so.2.1
> 00002b17ea95a000   2044K ----- /usr/lib64/libcom_err.so.2.1
> 00002b17eab59000      4K r---- /usr/lib64/libcom_err.so.2.1
> 00002b17eab5a000      4K rw--- /usr/lib64/libcom_err.so.2.1
> 00002b17eab5b000    188K r-x-- /usr/lib64/libk5crypto.so.3.1
> 00002b17eab8a000   2044K ----- /usr/lib64/libk5crypto.so.3.1
> 00002b17ead89000      8K r---- /usr/lib64/libk5crypto.so.3.1
> 00002b17ead8b000      4K rw--- /usr/lib64/libk5crypto.so.3.1
> 00002b17ead8c000      4K rw---   [ anon ]
> 00002b17ead8d000    284K r-x-- /usr/lib64/libnl-route-3.so.200.16.1
> 00002b17eadd4000   2044K ----- /usr/lib64/libnl-route-3.so.200.16.1
> 00002b17eafd3000     12K r---- /usr/lib64/libnl-route-3.so.200.16.1
> 00002b17eafd6000     16K rw--- /usr/lib64/libnl-route-3.so.200.16.1
> 00002b17eafda000      8K rw---   [ anon ]
> 00002b17eafdc000    104K r-x-- /usr/lib64/libnl-3.so.200.16.1
> 00002b17eaff6000   2044K ----- /usr/lib64/libnl-3.so.200.16.1
> 00002b17eb1f5000      8K r---- /usr/lib64/libnl-3.so.200.16.1
> 00002b17eb1f7000      4K rw--- /usr/lib64/libnl-3.so.200.16.1
> 00002b17eb1f8000     52K r-x-- /usr/lib64/libkrb5support.so.0.1
> 00002b17eb205000   2048K ----- /usr/lib64/libkrb5support.so.0.1
> 00002b17eb405000      4K r---- /usr/lib64/libkrb5support.so.0.1
> 00002b17eb406000      4K rw--- /usr/lib64/libkrb5support.so.0.1
> 00002b17eb407000     12K r-x-- /usr/lib64/libkeyutils.so.1.5
> 00002b17eb40a000   2044K ----- /usr/lib64/libkeyutils.so.1.5
> 00002b17eb609000      4K r---- /usr/lib64/libkeyutils.so.1.5
> 00002b17eb60a000      4K rw--- /usr/lib64/libkeyutils.so.1.5
> 00002b17eb60b000     88K r-x-- /usr/lib64/libresolv-2.17.so
> 00002b17eb621000   2048K ----- /usr/lib64/libresolv-2.17.so
> 00002b17eb821000      4K r---- /usr/lib64/libresolv-2.17.so
> 00002b17eb822000      4K rw--- /usr/lib64/libresolv-2.17.so
> 00002b17eb823000      8K rw---   [ anon ]
> 00002b17eb825000    132K r-x-- /usr/lib64/libselinux.so.1
> 00002b17eb846000   2048K ----- /usr/lib64/libselinux.so.1
> 00002b17eba46000      4K r---- /usr/lib64/libselinux.so.1
> 00002b17eba47000      4K rw--- /usr/lib64/libselinux.so.1
> 00002b17eba48000      8K rw---   [ anon ]
> 00002b17eba4a000    384K r-x-- /usr/lib64/libpcre.so.1.2.0
> 00002b17ebaaa000   2044K ----- /usr/lib64/libpcre.so.1.2.0
> 00002b17ebca9000      4K r---- /usr/lib64/libpcre.so.1.2.0
> 00002b17ebcaa000      4K rw--- /usr/lib64/libpcre.so.1.2.0
> 00002b17ebcab000      4K -----   [ anon ]
> 00002b17ebcac000   3352K rw---   [ anon ]
> 00002b17ec000000    132K rw---   [ anon ]
> 00002b17ec021000  65404K -----   [ anon ]
> 00002b17f0000000      4K -----   [ anon ]
> 00002b17f0001000   2048K rw---   [ anon ]
> 00002b17f0201000     16K r-x-- /usr/lib64/libhfi1verbs-rdmav2.so
> 00002b17f0205000   2044K ----- /usr/lib64/libhfi1verbs-rdmav2.so
> 00002b17f0404000      4K r---- /usr/lib64/libhfi1verbs-rdmav2.so
> 00002b17f0405000      4K rw--- /usr/lib64/libhfi1verbs-rdmav2.so
> 00002b17f0406000      4K rw---   [ anon ]
> 00002b17f0407000   4096K rw---   [ anon ]
> 00002b17f0807000   1032K rw---   [ anon ]
> 00002b17f0909000   4100K rw-s- 
> /tmp/openmpi-sessions-12001@node109_0/52426/1/1/vader_segment.node109.1
> 00002b17f0d0a000   4236K rw-s- /dev/shm/psm2_shm.1200100000001a17100200
> 00002b17f112d000    132K rw---   [ anon ]
> 00002b17f114e000   4236K rw-s- /dev/shm/psm2_shm.1200100000000a17100000 
> (deleted)
> 00002b17f1571000   8628K rw---   [ anon ]
> 00002b17f4000000    132K rw---   [ anon ]
> 00002b17f4021000  65404K -----   [ anon ]
> 00002b17f9e85000   9164K rw---   [ anon ]
> 00007ffd8b021000  31316K rw---   [ stack ]
> 00007ffd8cfa4000      8K r-x--   [ anon ]
> ffffffffff600000      4K r-x--   [ anon ]
>  total           539352K
> 
> 
> 
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > On Thursday, December 8, 2016, Christof Koehler <
> > christof.koeh...@bccms.uni-bremen.de> wrote:
> > 
> > > Hello everybody,
> > >
> > > I tried it with the nightly and the direct 2.0.2 branch from git which
> > > according to the log should contain that patch
> > >
> > > commit d0b97d7a408b87425ca53523de369da405358ba2
> > > Merge: ac8c019 b9420bb
> > > Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>>
> > > Date:   Wed Dec 7 18:24:46 2016 -0500
> > >     Merge pull request #2528 from rhc54/cmr20x/signals
> > >
> > > Unfortunately it changes nothing. The root rank stops and all other
> > > ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
> > > apparently in that allreduce. The stack trace looks a bit more
> > > interesting (git is always debug build ?), so I include it at the very
> > > bottom just in case.
> > >
> > > Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
> > > __exit etc. to try to catch signals. Would that be useful ? I need a
> > > moment to figure out how to do this, but I can definitively try.
> > >
> > > Some remark: During "make install" from the git repo I see a
> > >
> > > WARNING!  Common symbols found:
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
> > >           mpi-f08-types.o: 0000000000000004 C
> > > ompi_f08_mpi_2double_precision
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
> > >           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
> > >
> > > I have never noticed this before.
> > >
> > >
> > > Best Regards
> > >
> > > Christof
> > >
> > > Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
> > > #0  0x00002af84e4c669d in poll () from /lib64/libc.so.6
> > > #1  0x00002af850517496 in poll_dispatch () from 
> > > /cluster/mpi/openmpi/2.0.2/
> > > intel2016/lib/libopen-pal.so.20
> > > #2  0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
> > > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> > > #3  0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
> > > #4  0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
> > > requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
> > > #5  0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
> > > (sbuf=0xdecbae0,
> > > rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
> > > module=0xdee69e0) at base/coll_base_allreduce.c:225
> > > #6  0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > > (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
> > > comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
> > > #7  0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
> > > count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at 
> > > pallreduce.c:107
> > > #8  0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
> > > recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
> > > datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
> > > pallreduce_f.c:87
> > > #9  0x000000000045ecc6 in m_sum_i_ ()
> > > #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
> > > #11 0x00000000004325ff in vamp () at main.F:2640
> > > #12 0x000000000040de1e in main ()
> > > #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
> > > #14 0x000000000040dd29 in _start ()
> > >
> > > On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;>
> > > wrote:
> > > > Hi Christof
> > > >
> > > > Sorry if I missed this, but it sounds like you are saying that one of
> > > your procs abnormally terminates, and we are failing to kill the remaining
> > > job? Is that correct?
> > > >
> > > > If so, I just did some work that might relate to that problem that is
> > > pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
> > > https://github.com/open-mpi/ompi/pull/2528>
> > > >
> > > > Would you be able to try that?
> > > >
> > > > Ralph
> > > >
> > > > > On Dec 7, 2016, at 9:37 AM, Christof Koehler <
> > > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > > > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
> > > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > > > >>>>
> > > > >>> I really think the hang is a consequence of
> > > > >>> unclean termination (in the sense that the non-root ranks are not
> > > > >>> terminated) and probably not the cause, in my interpretation of what
> > > I
> > > > >>> see. Would you have any suggestion to catch signals sent between
> > > orterun
> > > > >>> (mpirun) and the child tasks ?
> > > > >>
> > > > >> Do you know where in the code the termination call is?  Is it
> > > actually calling mpi_abort(), or just doing something ugly like calling
> > > fortran “stop”?  If the latter, would that explain a possible hang?
> > > > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
> > > wannier90 input contains
> > > > > an error, a restart is requested and the wannier90.chk file the 
> > > > > restart
> > > > > information is missing.
> > > > > "
> > > > > Exiting.......
> > > > > Error: restart requested but wannier90.chk file not found
> > > > > "
> > > > > So it must terminate.
> > > > >
> > > > > The termination happens in the libwannier.a, source file io.F90:
> > > > >
> > > > > write(stdout,*)  'Exiting.......'
> > > > > write(stdout, '(1x,a)') trim(error_msg)
> > > > > close(stdout)
> > > > > stop "wannier90 error: examine the output/error file for details"
> > > > >
> > > > > So it calls stop  as you assumed.
> > > > >
> > > > >> Presumably someone here can comment on what the standard says about
> > > the validity of terminating without mpi_abort.
> > > > >
> > > > > Well, probably stop is not a good way to terminate then.
> > > > >
> > > > > My main point was the change relative to 1.10 anyway :-)
> > > > >
> > > > >
> > > > >>
> > > > >> Actually, if you’re willing to share enough input files to reproduce,
> > > I could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix
> > > a crash that was apparently addressed by some change in the memory
> > > allocator in a recent version of openmpi.  Just e-mail me if that’s the
> > > case.
> > > > >
> > > > > I think that is no longer necessary ? In principle it is no problem 
> > > > > but
> > > > > it at the end of a (small) GW calculation, the Si tutorial example.
> > > > > So the mail would be abit larger due to the WAVECAR.
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > Noam
> > > > >>
> > > > >>
> > > > >> ____________
> > > > >> ||
> > > > >> |U.S. NAVAL|
> > > > >> |_RESEARCH_|
> > > > >> LABORATORY
> > > > >> Noam Bernstein, Ph.D.
> > > > >> Center for Materials Physics and Technology
> > > > >> U.S. Naval Research Laboratory
> > > > >> T +1 202 404 8628  F +1 202 404 7546
> > > > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
> > > > >
> > > > > --
> > > > > Dr. rer. nat. Christof Köhler       email:
> > > c.koeh...@bccms.uni-bremen.de <javascript:;>
> > > > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > > > 28359 Bremen
> > > > >
> > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users@lists.open-mpi.org <javascript:;>
> > > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> > > >
> > >
> > > --
> > > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > > <javascript:;>
> > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > 28359 Bremen
> > >
> > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > >
> 
> -- 
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen  
> 
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/



-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to