I get the following error whenever my problem definition gets “too big”, and I run more than one MPI process on a specific node:
[1576877928.118767] [tebow148:18579:0] ib_md.c:478 UCX ERROR ibv_reg_mr(address=0x2b77bc51b640, length=131072, access=0xf) failed: Cannot allocate memory [1576877928.118799] [tebow148:18579:0] ucp_mm.c:111 UCX ERROR failed to register address 0x2b77bc51b640 length 131072 on md[3]=ib/mlx4_0: Input/output error [1576877928.118803] [tebow148:18579:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b77bc51b640 len 131072: Input/output error [1576877928.118749] [tebow148:18580:0] ib_md.c:478 UCX ERROR ibv_reg_mr(address=0x2b12d84fb640, length=131072, access=0xf) failed: Cannot allocate memory [1576877928.118790] [tebow148:18580:0] ucp_mm.c:111 UCX ERROR failed to register address 0x2b12d84fb640 length 131072 on md[3]=ib/mlx4_0: Input/output error [1576877928.118795] [tebow148:18580:0] ucp_request.c:264 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b12d84fb640 len 131072: Input/output error [tebow148:18579:0:18579] rndv.c:364 Assertion `status == UCS_OK' failed If I run even larger problem definitions with a single MPI process on a given node I still get the error (though the smaller problem with only 1 MPI task then executes). It appears I’m bumping up into some memory limit that I’m trying to track down to free up. The problem I run takes 45 minutes to get to the failure point, so it’s a slow debug process. My ulimit settings are: galloway@tebow:$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1033147 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited I thought perhaps my log_num_mtt settings were the culprit, and updated them to the following (log_num_mtt previously was 0, log_mtts_per_seg has always stayed at 3), but in both cases I get the same error: cat /sys/module/mlx4_core/parameters/log_num_mtt 26 cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 3 The failed run was run with the following variables exported: export OMPI_MCA_btl=^vader,tcp,openib export UCX_TLS=rc,self,sm My openmpi installation was configured with: $ ./configure --with-verbs=/usr/ --with-ucx=/opt/ucx/ucx-1.6.1/install/ --enable-mca-no-build=btl-uct --prefix=/opt/openmpi/blds/mpi-4.0.2-intel-2019 --enable-orterun-prefix-by-default CC=icc CXX=icpc F77=ifort FC=ifort The process that fails is run across 18 MPI processes with 28 threads on each process (1 per node on 14 nodes, 3 MPI tasks on node148, along with the master process on 148). The master process takes up 52 GB, and when I only launch one MPI task per node, there are no errors for the smaller problem. When I launch two or more MPI tasks the smaller problem fails. Additionally if I make the size of the problem bigger, to 72 GB on the master process, it then fails with just one MPI process on a given node. So far it has only failed on one particular node, but that node has also been the first worker process listed in the machinefile (i.e. the second entry, or “slots=2” on that node). I will run a test case to see if the failure is always on that particular node, or if it simply fails on the first worker listed (which I assume). This node is our newest node, 96 cores, 768 GB RAM, a Dell PowerEdge R840 purchased ~3 months ago. Thanks, Jack