I get the following error whenever my problem definition gets “too big”, and I 
run more than one MPI process on a specific node:

[1576877928.118767] [tebow148:18579:0]          ib_md.c:478  UCX  ERROR 
ibv_reg_mr(address=0x2b77bc51b640, length=131072, access=0xf) failed: Cannot 
allocate memory
[1576877928.118799] [tebow148:18579:0]         ucp_mm.c:111  UCX  ERROR failed 
to register address 0x2b77bc51b640 length 131072 on md[3]=ib/mlx4_0: 
Input/output error
[1576877928.118803] [tebow148:18579:0]    ucp_request.c:264  UCX  ERROR failed 
to register user buffer datatype 0x20 address 0x2b77bc51b640 len 131072: 
Input/output error
[1576877928.118749] [tebow148:18580:0]          ib_md.c:478  UCX  ERROR 
ibv_reg_mr(address=0x2b12d84fb640, length=131072, access=0xf) failed: Cannot 
allocate memory
[1576877928.118790] [tebow148:18580:0]         ucp_mm.c:111  UCX  ERROR failed 
to register address 0x2b12d84fb640 length 131072 on md[3]=ib/mlx4_0: 
Input/output error
[1576877928.118795] [tebow148:18580:0]    ucp_request.c:264  UCX  ERROR failed 
to register user buffer datatype 0x20 address 0x2b12d84fb640 len 131072: 
Input/output error
[tebow148:18579:0:18579]        rndv.c:364  Assertion `status == UCS_OK' failed

If I run even larger problem definitions with a single MPI process on a given 
node I still get the error (though the smaller problem with only 1 MPI task 
then executes).  It appears I’m bumping up into some memory limit that I’m 
trying to track down to free up.  The problem I run takes 45 minutes to get to 
the failure point, so it’s a slow debug process.

My ulimit settings are:

galloway@tebow:$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1033147
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I thought perhaps my log_num_mtt settings were the culprit, and updated them to 
the following (log_num_mtt previously was 0, log_mtts_per_seg has always stayed 
at 3), but in both cases I get the same error:

cat /sys/module/mlx4_core/parameters/log_num_mtt
26
cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
3

The failed run was run with the following variables exported:
export OMPI_MCA_btl=^vader,tcp,openib
export UCX_TLS=rc,self,sm

My openmpi installation was configured with:
$ ./configure --with-verbs=/usr/ --with-ucx=/opt/ucx/ucx-1.6.1/install/ 
--enable-mca-no-build=btl-uct --prefix=/opt/openmpi/blds/mpi-4.0.2-intel-2019 
--enable-orterun-prefix-by-default CC=icc CXX=icpc F77=ifort FC=ifort

The process that fails is run across 18 MPI processes with 28 threads on each 
process (1 per node on 14 nodes, 3 MPI tasks on node148, along with the master 
process on 148).
The master process takes up 52 GB, and when I only launch one MPI task per 
node, there are no errors for the smaller problem.  When I launch two or more 
MPI tasks the smaller problem fails.
Additionally if I make the size of the problem bigger, to 72 GB on the master 
process, it then fails with just one MPI process on a given node.

So far it has only failed on one particular node, but that node has also been 
the first worker process listed in the machinefile (i.e. the second entry, or 
“slots=2” on that node).

I will run a test case to see if the failure is always on that particular node, 
or if it simply fails on the first worker listed (which I assume).  This node 
is our newest node, 96 cores, 768 GB RAM, a Dell PowerEdge R840 purchased ~3 
months ago.

Thanks,
Jack

Reply via email to