All,

I have been experimenting with large window allocations recently and have made some interesting observations that I would like to share.

The system under test:
  - Linux cluster equipped with IB,
  - Open MPI 2.1.1,
  - 128GB main memory per node
  - 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate and write to memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodes single large allocations succeed but with the repeating allocate/free cycle in the attached code I see the application being reproducibly being killed by the OOM at 25GB allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind I get an error from MPI_Win_allocate at ~50GB that I cannot make sense of:

```
MPI_Alloc_mem:  53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[n131302:11989] ***    and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the following error from both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

  Local host:  n131702
  Space Requested: 6710890760 B
  Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as expected, i.e., I can allocate ~50GB per process. I understand that I can set $TMP to a bigger filesystem (such as lustre) but then I am greeted with a warning on each allocation and performance seems to drop. Is there a way to fall back to the allocation strategy used in case 2)?

4) It is also worth noting the time it takes to allocate the memory: while the allocations are in the sub-millisecond range for both MPI_Allocate and MPI_Win_allocate_shared, it takes >24s to allocate 100GB using MPI_Win_allocate and the time increasing linearly with the allocation size.

Are these issues known? Maybe there is documentation describing work-arounds? (esp. for 3) and 4))

I am attaching a small benchmark. Please make sure to adjust the MEM_PER_NODE macro to suit your system before you run it :) I'm happy to provide additional details if needed.

Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
#include <assert.h>
#include <stdio.h>
#include <mpi.h>
#include <string.h>

#define MEM_PER_NODE (100*1024*1024*1024UL)
#define NUM_STEPS 10

//#define USE_MEM(__mem, __size) memset(__mem, 0, __size)
//#define USE_MEM(__mem, __size)   parallel_memset(__mem, 0, __size)
#define USE_MEM(__mem, __size)  

static int comm_rank;


static void parallel_memset(char *baseptr, int val, size_t size) 
{
#pragma omp parallel for 
  for (size_t i = 0; i < size; ++i) {
    baseptr[i] = val;
  }
}


static void test_alloc_mem(size_t size)
{
  char *baseptr;
  if (comm_rank == 0) {
    printf("MPI_Alloc_mem: %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Alloc_mem(size, MPI_INFO_NULL, &baseptr);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
    printf("(%fs)\n", end - start);
  }
  USE_MEM(baseptr, size);

  MPI_Free_mem(baseptr);
}

static void test_win_allocate(size_t size)
{
  char *baseptr;
  MPI_Win win;
  if (comm_rank == 0) {
    printf("MPI_Win_allocate: %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Win_allocate(size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &baseptr, &win);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
    printf("(%fs)\n", end - start);
  }

  USE_MEM(baseptr, size);
  MPI_Win_free(&win);
}

static void test_win_allocate_shared(size_t size, MPI_Comm sharedmem_comm)
{
  char *baseptr;
  MPI_Win win;
  
  if (comm_rank == 0) {
    printf("MPI_Win_allocate_shared %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, sharedmem_comm, &baseptr, &win);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
    printf("(%fs)\n", end - start);
  }

  USE_MEM(baseptr, size);
  MPI_Win_free(&win);
}



int main(int argc, char **argv) 
{
  MPI_Init(&argc, &argv);

  int procs_per_node;
  MPI_Comm sharedmem_comm;
  MPI_Comm_split_type(
    MPI_COMM_WORLD,
    MPI_COMM_TYPE_SHARED,
    1,
    MPI_INFO_NULL,
    &sharedmem_comm);
  MPI_Comm_size(sharedmem_comm, &procs_per_node);
  size_t max_alloc = (MEM_PER_NODE) / procs_per_node;
  size_t min_alloc = (max_alloc / (1<<(NUM_STEPS-1)));

  MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

  if (comm_rank == 0) {
    printf("Processes per node: %d\n", procs_per_node);
    printf("Minimum allocation per process: %zu\n", min_alloc);
    printf("Maximum allocation per process: %zu\n", max_alloc);
  }

  for (size_t size = min_alloc; size <= max_alloc; size *= 2) {
    test_alloc_mem(size);
  }


  for (size_t size = min_alloc; size <= max_alloc; size *= 2) {
    test_win_allocate(size);
  }

  for (size_t size = min_alloc; size <= max_alloc; size *= 2) {
    test_win_allocate_shared(size, sharedmem_comm);
  }

  MPI_Finalize();
  return 0;
}

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to