I'm seeing hangs when MPI_Abort is called. This is with openmpi 1.10.3. e.g:
program output:
Testing -- big dataset test (bigdset)
Proc 3: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing -- big dataset test (bigdset)
Proc 0: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing -- big dataset test (bigdset)
Proc 2: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Testing -- big dataset test (bigdset)
Proc 5: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
aborting MPI processes
Testing -- big dataset test (bigdset)
Proc 1: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
Testing -- big dataset test (bigdset)
Proc 4: *** Parallel ERROR ***
VRFY (sizeof(MPI_Offset)>4) failed at line 479 in ../../testpar/t_mdset.c
aborting MPI processes
strace of mpiexec process:
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN},
{fd=14, events=POLLIN}], 4, -1
mpiexec 21511 orion 1w REG 8,3 10547 17696145
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
mpiexec 21511 orion 2w REG 8,3 10547 17696145
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/builddir/build/BUILD/hdf5-1.8.17/openmpi/testpar/testphdf5.chklog
mpiexec 21511 orion 3u unix 0xdaedbc80 0t0 4818918 type=STREAM
mpiexec 21511 orion 4u unix 0xdaed8000 0t0 4818919 type=STREAM
mpiexec 21511 orion 5u a_inode 0,11 0 8731 [eventfd]
mpiexec 21511 orion 6u REG 0,38 0 4818921
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/dev/shm/open_mpi.0000
(deleted)
mpiexec 21511 orion 7r FIFO 0,10 0t0 4818922 pipe
mpiexec 21511 orion 8w FIFO 0,10 0t0 4818922 pipe
mpiexec 21511 orion 9r DIR 8,3 4096 15471703
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root
mpiexec 21511 orion 10r DIR 0,16 0 82
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/sys/firmware/devicetree/base/cpus
mpiexec 21511 orion 11u IPv4 4818926 0t0 TCP *:39619 (LISTEN)
mpiexec 21511 orion 12r FIFO 0,10 0t0 4818927 pipe
mpiexec 21511 orion 13w FIFO 0,10 0t0 4818927 pipe
mpiexec 21511 orion 14r FIFO 8,3 0t0 17965730
/var/lib/mock/fedora-rawhide-armhfp--orion-hdf5/root/tmp/openmpi-sessions-mockbuild@arm03-packager00_0/46622/0/debugger_attach_fifo
Any suggestions on what to look for? FWIW, it was a 6 process run on a 4 core
machine.
Thanks.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com