Dear OpenMPI Mailing List,

I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.

The program will do the following for a variety of different parameters:
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype.
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter
nr_blocks of size bsize. Each rank only gets access to a subarray containing
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The respective
subarray of each rank starts at
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype.
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype.
6. Write one block on each rank, using the collective routine.
7. Clean up.

The above will be repeated for different values of bsize and nr_blocks.
Please note, that there is no overflow of the used basic dataype int.
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
00000000  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01 
|................|
*
1f400000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*
3e800000  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02 
|................|
*
5dc00000
This example is to be read in the following manner:
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
-From byte 3e800000 to 5dc00000(which is equal to 1500 Mib) the file
contains the value 02 in each byte.
-The file ends here.
This is the correct output of the above outlined program with parameters
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.

The tests and their results are defined in comments in the source.
The final conclusions, I derive from the tests, are the following:

1. If the filetype used in the view is set in a way that it describes an
amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with
fname="test_8_blocks" uses a total filetype size of 4000 MiB which is
smaller than 4Gib) the code works as expected.
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior
(please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames
"test_too_large_blocks_single" and
"test_too_large_blocks_single_even_larger", respectively).

There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to compute
the byte composition inside the filetype. Since the filetype is defined
using two 32bit integer variables, this can easily lead to integer
overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.

I looked at ILP64 <https://software.intel.com/en-us/node/528914>
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.

I looked at integer overflow
<https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals>
(FPE_INTOVF_TRAP) trapping, which could help to verify the source of the
problem, but it doesn't seem to be possible for C. Intel does not
<https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156>
offer any built-in integer overflow trapping.

There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.

Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.

Thank you very much for reading everything ;)

Kind Regards,

Nils
#include <stdlib.h> //malloc, free
#include <mpi.h>
#include <string.h> // memset
#include <stdio.h>
/*
hd test_ok
gives
00000000  01 01 01 01 01 01 01 01  01 01 01 01 01 01 01 01  |................|
*
1f400000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
3e800000  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
*
5dc00000

hd test_not_ok
gives something completely different or crashes
*/

static void
write(int bsize, int nr_blocks, const char *fname, MPI_Comm comm)
{
  int rank, size;
  MPI_Comm_rank(comm, &rank);
  MPI_Comm_size(comm, &size);

  MPI_Datatype blk_filetype;
  MPI_Datatype full_filetype;

  MPI_Type_contiguous(bsize, MPI_BYTE, &blk_filetype);
  MPI_Type_commit(&blk_filetype);

  int nr_blocks_per_rank = nr_blocks / size;
  int sizes[1] = {nr_blocks};
  int subsizes[1] = {nr_blocks_per_rank};
  int starts[1] = {rank * nr_blocks_per_rank};

  MPI_Type_create_subarray(1, sizes, subsizes, starts, MPI_ORDER_C,
    blk_filetype, &full_filetype);
  MPI_Type_commit(&full_filetype);

  void *blk_mem = malloc(bsize * sizeof(*blk_mem));
  memset(blk_mem, rank + 1, bsize);

  MPI_File file;
  MPI_File_open(comm, fname, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &file);
  MPI_File_set_view(file, 0, blk_filetype, full_filetype, "native", MPI_INFO_NULL);
  MPI_File_write_all(file, blk_mem, 1, blk_filetype, MPI_STATUS_IGNORE);
  MPI_File_close(&file);

  free(blk_mem);
  MPI_Type_free(&blk_filetype);
  MPI_Type_free(&full_filetype);
}

#define printf0(comm, args...) do { int __rank; MPI_Comm_rank(comm, &__rank); if (__rank == 0) { printf(args); } } while(0)

int
main (int argc, char **argv)
{
  MPI_Init(&argc, &argv);
  MPI_Comm comm = MPI_COMM_WORLD;
  int rank, size;
  MPI_Comm_rank(comm, &rank);
  MPI_Comm_size(comm, &size);

  // 1. WORKS
  // 500 MiB block size, total = 2000 Mib
  // written data: 0 - 500 MiB && 1000 Mib - 1500 Mib 
  printf0(comm, "TEST 1 Start\n");
  write(500 * 1024 * 1024, 4, "test_ok", comm);
  printf0(comm, "TEST 1 End\n");

  // 2. WORKS
  // 500 MiB block size, total = 4000 Mib
  // written data: 0 - 500 MiB && 2000 Mib - 2500 Mib
  // I'm writing up to 2.44140625 Gib of data > 2**31 
  printf0(comm, "TEST 2 Start\n");
  write(500 * 1024 * 1024, 8, "test_8_blocks", comm);
  printf0(comm, "TEST 2 End\n");

  if (rank == 0) {
    //both WORK
    // 3.
    printf0(comm, "TEST 3 Start\n");
    write(1 * 1024 * 1024 * 1024, 4, "test_too_large_blocks_single", MPI_COMM_SELF);
    printf0(comm, "TEST 3 End\n");
    // 4.
    printf0(comm, "TEST 4 Start\n");
    write(1 * 1024 * 1024 * 1024, 8, "test_too_large_blocks_single_even_larger", MPI_COMM_SELF);
    printf0(comm, "TEST 4 End\n");
  }

  // 5. BROKEN
  // 500 MiB block size, total = 8000 Mib
  // written data: 0 - 500 MiB && 4000 Mib - 4500 Mib
  printf0(comm, "TEST 5 Start\n");
  write(500 * 1024 * 1024, 16, "test_16_blocks", comm);
  printf0(comm, "TEST 5 End\n");

  // 6. BROKEN
  // 1 Gib block size, total = 4 Gib
  // written data: 0 - 1Gib && 2 Gib - 3 Gib
  // I'm writing more than 2**31 = 2Gib of data
  printf0(comm, "TEST 6 Start\n");
  write(1 * 1024 * 1024 * 1024, 4, "test_too_large_blocks", comm);
  printf0(comm, "TEST 6 End\n");

  MPI_Finalize();
}
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to