Hi folks,
I've been running into some issues using the MPI_IN_PLACE option with
MPI_Ireduce, in what seems to be a regression from OpenMPI v1.10.1. In
particular, I encounter data corruption if non-root ranks supply the same
pointers for 'send' and 'recv', even though this should have no bearing on the
results since they do not expect to receive anything. In particular, take this
scenario:
Root rank receiving results of reduction, say rank = 0, does MPI_Ireduce( MPI_IN_PLACE,
data, N, mpiType<T>(), MPI_MIN, sink, MPI_COMM_WORLD, &req );
Ranks sharing data for reduction, say ranks 1, 2, 3, do MPI_Ireduce( data, recv, N,
mpiType<T>(), MPI_MIN, sink, MPI_COMM_WORLD, &req );
As I understand it from the documentation, the value of the 'recv' pointer on
ranks sharing data with the root rank should not be relevant. However, when
'recv' is set equal to the 'data' pointer on non-root ranks, I get
non-deterministic garbage values as the result of the reduction.
This problem goes away if you supply 'nullptr' as the recv pointer on non-root
ranks; it also goes away if you replace MPI_Ireduce with MPI_Reduce.
Source code to reproduce is here:
https://gist.github.com/akessler/3e911102892f3d6442feeb254f74665f
mpic++ --std=c++11 -Wall -Wextra -oreducebug reducebug.cpp
mpiexec -np 3 ./reducebug
RESULT:
<snip>
Case with T = int32_t
Expect: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Actual: 0, 0, -649746440, 0, -649746440, 0, 0, 0, 0, 0, [FAILURE] : Data
mismatch at sink <another run> Case with T = uint64_t
Expect: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Actual: 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
[FAILURE] : Data mismatch at sink
You may need to run this a couple times to get it to fail, although it does so
about 95% of the time.
Furthermore, with just two ranks, I get a consistent immediate segfault which
seems to suggest a memory copy is being done between these pointers:
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x90
[ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2ac11c395100]
[ 1] ./reduce(__intel_ssse3_rep_memcpy+0x23fe)[0x41022e]
[ 2]
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libopen-pal.so.20(opal_cuda_memcpy+0x7d)[0x2ac11ce4cb0d]
[ 3]
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libopen-pal.so.20(opal_convertor_pack+0x166)[0x2ac11ce443a6]
[ 4]
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libmpi.so.20(PMPI_Pack+0x17d)[0x2ac11b8bf85d]
[ 5]
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_ireduce+0x74e)[0x2ac12f093a3e]
[ 6]
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libmpi.so.20(PMPI_Ireduce+0x98)[0x2ac11b8c1e18]
[ 7] ./reducebug[0x40270d]
[ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac11c5c3b15]
[ 9] ./reducebug[0x402509]
*** End of error message ***
System environment:
# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
# icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64,
Version 15.0.6.233 Build 20151119
Open MPI: 2.0.1
Open MPI repo revision: v2.0.0-257-gee86e07 Open MPI release date: Sep 02, 2016
Open RTE: 2.0.1
Any insight here would be appreciated!
Thanks,
Andre Kessler
Software Engineer
Space Exploration Technologies
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users