Hello,
I'm running a MPI program which uses passive RMA to access shared arrays.
On some systems this program does not work as expected.
When working with several nodes, even though it produces the correct results,
only the process with rank 0 (the one with the shared arrays on its local
memory) is actually able to work on the shared arrays, which is an undesired
behavior.
This has happened with OpenMPI4, in particular with OpenMPI4.0.5 and
OpenMPI4.1.4.
However, when compiling and running using OpenMPI3 (in particular OpenMPI3.1.4)
the program works as expected and all processes work on the shared structures.
In addition, when compiling OpenMPI4 to use verbs instead of UCX, the program
will also works as expected.
Thus, we have concluded that there may be a problem regarding the use of UCX on
OpenMPI.
About the system where I am working on:
- Nodes on the system are connected through an InfiniBand FDR network.
- I'm running g++ (GCC) 8.3.0 and different versions of OpenMPI, as stated
previously.
I attach a sample code to help to reproduce the undesired behavior.
I also include the output of the test program (1) when behaving unpropertly and
(2) when behaving propertly.
Can someone help me understand if there's a problem with the program or with
OpenMPI and UCX?
Thanks a lot!
(1) Output behaving unpropertly:
+--+
Rank: 0 ||| Position: 0
Rank: 0 ||| Position: 1
Rank: 0 ||| Position: 2
...
Rank: 0 ||| Position: 85
Rank: 0 ||| Position: 86
Rank: 0 ||| Position: 87
...
Rank: 0 ||| Position: 1997
Rank: 0 ||| Position: 1998
Rank: 0 ||| Position: 1999
*
*
*
*
* Small correctness check *
Position 0
||| Input value: 0
||| Output value:0.00
||| Expected output: 0.00
...
Position 1999
||| Input value: 1999
||| Output value:4997.50
||| Expected output: 4997.50
*
*
*
*
* Accesses per process data *
Process 0 accesses: 2000
Process 1 accesses: 0
Process 2 accesses: 0
Process 3 accesses: 0
Process 4 accesses: 0
Process 5 accesses: 0
Process 6 accesses: 0
Process 7 accesses: 0
+--+
(2) Output behaving propertly:
+--+
Rank: 0 ||| Position: 7
Rank: 0 ||| Position: 8
Rank: 0 ||| Position: 9
...
Rank: 3 ||| Position: 24
Rank: 4 ||| Position: 28
Rank: 7 ||| Position: 19
...
Rank: 3 ||| Position: 1976
Rank: 2 ||| Position: 1985
Rank: 6 ||| Position: 1994
*
*
*
*
* Small correctness check *
Position 0
||| Input value: 0
||| Output value:0.00
||| Expected output: 0.00
...
Position 1999
||| Input value: 1999
||| Output value:4997.50
||| Expected output: 4997.50
*
*
*
*
* Accesses per process data *
Process 0 accesses: 425
Process 1 accesses: 226
Process 2 accesses: 222
Process 3 accesses: 226
Process 4 accesses: 228
Process 5 accesses: 227
Process 6 accesses: 222
Process 7 accesses: 224
+--+
#include
#include
#include
#include
using std::cout;
using std::endl;
// MPI added
#include
#define MPI_RANK_0 0
#define MULT_FACTOR 2.5
static void
process_data(int *input_buffer,
double *output_buffer,
size_t BLOCK_SIZE){
for (size_t i = 0; i < BLOCK_SIZE; i++)
{
output_buffer[i] = (double) input_buffer[i] * MULT_FACTOR;
}
}
int
main(int argc, char **argv) {
int rank, number_of_processes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &number_of_processes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
const size_t VECTOR_SIZE = 2000;
const size_t MY_SIZE = rank ? 0 : VECTOR_SIZE;
int * main_input_buffer;
double * main_output_buffer;
// Rank 0 has the input data
if (rank == MPI_RANK_0){
MPI_Alloc_mem(VECTOR_SIZE * sizeof(int), MPI_INFO_NULL, &main_input_buffer);
MPI_Alloc_mem(VECTOR_SIZE * sizeof(double), MPI_INFO_NULL, &main_output_buffer);
for (size_t i = 0; i < VECTOR_SIZE; i++){
main_input_buffer[i] = (int)i;
}
}
// We will create a shared index to access shared data on Rank 0
// Also, we will share input and output buffers on P0
size_t * main_buffer_index;
MPI_Alloc_mem(1 * sizeof(size_t), MPI_INFO_NULL, &main_buffer_index);
*main_buffer_index = 0;
MPI_Barrier(MPI_COMM_WORLD);
MPI_Win index_window, input_window, output_window;
MPI_Win_create(main_buffer_index, 1 * sizeof(size_t), sizeof(size_t),
MPI_INFO_NULL, MPI_COMM_WORLD, &index_window);
MPI_Win_create(main_input_buffer,
MY_SIZE * sizeof(int),
sizeof(int),
MPI_INFO_NULL,