I'm seeing hangs when MPI_Abort is called.  This is with openmpi 1.10.3.  e.g:

program output:

Testing  -- big dataset test (bigdset)
Proc 3: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing  -- big dataset test (bigdset)
Proc 0: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing  -- big dataset test (bigdset)
Proc 2: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Testing  -- big dataset test (bigdset)
Proc 5: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
aborting MPI processes
aborting MPI processes
Testing  -- big dataset test (bigdset)
Proc 1: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing  -- big dataset test (bigdset)
Proc 4: *** Parallel ERROR ***
    VRFY (sizeof(MPI_Offset)>4) failed at line  479 in ../../testpar/t_mdset.c
aborting MPI processes


strace of mpiexec process:

poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN},
{fd=14, events=POLLIN}], 4, -1

mpiexec 21511 orion    1w      REG        8,3    10547 17696145
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
mpiexec 21511 orion    2w      REG        8,3    10547 17696145
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
mpiexec 21511 orion    3u     unix 0xdaedbc80      0t0  4818918 type=STREAM
mpiexec 21511 orion    4u     unix 0xdaed8000      0t0  4818919 type=STREAM
mpiexec 21511 orion    5u  a_inode       0,11        0     8731 [eventfd]
mpiexec 21511 orion    6u      REG       0,38        0  4818921
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/dev/shm/open_mpi.0000
(deleted)
mpiexec 21511 orion    7r     FIFO       0,10      0t0  4818922 pipe
mpiexec 21511 orion    8w     FIFO       0,10      0t0  4818922 pipe
mpiexec 21511 orion    9r      DIR        8,3     4096 15471703
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root
mpiexec 21511 orion   10r      DIR       0,16        0       82
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/sys/firmware/devicetree/base/cpus
mpiexec 21511 orion   11u     IPv4    4818926      0t0      TCP *:39619 (LISTEN)
mpiexec 21511 orion   12r     FIFO       0,10      0t0  4818927 pipe
mpiexec 21511 orion   13w     FIFO       0,10      0t0  4818927 pipe
mpiexec 21511 orion   14r     FIFO        8,3      0t0 17965730
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/tmp/openmpi-sessions-mockbuild@arm03-packager00_0/46622/0/debugger_attach_fifo

Any suggestions on what to look for?  FWIW, it was a 6 process run on a 4 core
machine.

Thanks.

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                   http://www.nwra.com

Reply via email to