Hi folks,

I've been running into some issues using the MPI_IN_PLACE option with 
MPI_Ireduce, in what seems to be a regression from OpenMPI v1.10.1. In 
particular, I encounter data corruption if non-root ranks supply the same 
pointers for 'send' and 'recv', even though this should have no bearing on the 
results since they do not expect to receive anything. In particular, take this 
scenario:

Root rank receiving results of reduction, say rank = 0, does MPI_Ireduce( 
MPI_IN_PLACE, data, N, mpiType<T>(), MPI_MIN, sink, MPI_COMM_WORLD, &req ); 

Ranks sharing data for reduction, say ranks 1, 2, 3, do MPI_Ireduce( data, 
recv, N, mpiType<T>(), MPI_MIN, sink, MPI_COMM_WORLD, &req );

As I understand it from the documentation, the value of the 'recv' pointer on 
ranks sharing data with the root rank should not be relevant. However, when 
'recv' is set equal to the 'data' pointer on non-root ranks, I get 
non-deterministic garbage values as the result of the reduction.

This problem goes away if you supply 'nullptr' as the recv pointer on non-root 
ranks; it also goes away if you replace MPI_Ireduce with MPI_Reduce.

Source code to reproduce is here:
https://gist.github.com/akessler/3e911102892f3d6442feeb254f74665f

mpic++ --std=c++11 -Wall -Wextra -oreducebug reducebug.cpp
mpiexec -np 3 ./reducebug

RESULT:
<snip>
Case with T = int32_t
        Expect: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
        Actual: 0, 0, -649746440, 0, -649746440, 0, 0, 0, 0, 0, [FAILURE] : 
Data mismatch at sink <another run> Case with T = uint64_t
        Expect: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
        Actual: 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
[FAILURE] : Data mismatch at sink

You may need to run this a couple times to get it to fail, although it does so 
about 95% of the time.

Furthermore, with just two ranks,  I get a consistent immediate segfault which 
seems to suggest a memory copy is being done between these pointers:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x90
[ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x2ac11c395100]
[ 1] ./reduce(__intel_ssse3_rep_memcpy+0x23fe)[0x41022e]
[ 2] 
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libopen-pal.so.20(opal_cuda_memcpy+0x7d)[0x2ac11ce4cb0d]
[ 3] 
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libopen-pal.so.20(opal_convertor_pack+0x166)[0x2ac11ce443a6]
[ 4] 
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libmpi.so.20(PMPI_Pack+0x17d)[0x2ac11b8bf85d]
[ 5] 
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_ireduce+0x74e)[0x2ac12f093a3e]
[ 6] 
<snip>/openmpi/intel/2015.6.233/2.0.1/lib/libmpi.so.20(PMPI_Ireduce+0x98)[0x2ac11b8c1e18]
[ 7] ./reducebug[0x40270d]
[ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac11c5c3b15]
[ 9] ./reducebug[0x402509]
*** End of error message ***

System environment:
# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

# icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, 
Version 15.0.6.233 Build 20151119

Open MPI: 2.0.1
Open MPI repo revision: v2.0.0-257-gee86e07 Open MPI release date: Sep 02, 2016 
Open RTE: 2.0.1

Any insight here would be appreciated!

Thanks,

Andre Kessler
Software Engineer
Space Exploration Technologies
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to