I manage a cluster that is very heterogeneous. Some nodes have
InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded to
CentOS 7, and built a new software stack for CentOS 7. We are using
OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler.
We just noticed that when jobs are sent to the nodes with IB, the
segfault immediately, with the segfault appearing to come from
libibverbs.so. This is what I see in the stderr output for one of these
failed jobs:
srun: error: greene021: tasks 0-3: Segmentation fault
And here is what I see in the log messages of the compute node where
that segfault happened:
Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at
7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4
Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at
7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f23d51ec000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at
7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7ff504ba7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at
7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7fa58abc7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f0635f3a000+18000]
Jul 23 15:19:41 greene021 kernel
Any idea what is going on here, or how to debug further? I've been using
OpenMPI for years, and it usually just works.
I normally start my job with srun like this:
srun ./mpihello
But even if I try to take IB out of the equation by starting the job
like this:
mpirun -mca btl ^openib ./mpihello
I still get a segfault issue, although the message to stderr is now a
little different:
--
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpirun noticed that process rank 1 with PID 8502 on node greene021
exited on signal 11 (Segmentation fault).
--
The segfaults happens immediately. It seems to happen as soon as
MPI_Init() is called. The program I'm running is very simple MPI "Hello
world!" program.
The output of ompi_info is below my signature, in case that helps.
Prentice
$ ompi_info
Package: Open MPI u...@host.example.com Distribution
Open MPI: 4.0.3
Open MPI repo revision: v4.0.3
Open MPI release date: Mar 03, 2020
Open RTE: 4.0.3
Open RTE repo revision: v4.0.3
Open RTE release date: Mar 03, 2020
OPAL: 4.0.3
OPAL repo revision: v4.0.3
OPAL release date: Mar 03, 2020
MPI API: 3.1.0
Ident string: 4.0.3
Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
Configured architecture: x86_64-unknown-linux-gnu
Configure host: dawson027.pppl.gov
Configured by: lglant
Configured on: Mon Jun 1 12:37:07 EDT 2020
Configure host: dawson027.pppl.gov
Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3'
'--with-ucx' '--with-verbs' '--with-libfabric'
'--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64'
'--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'
Built by: lglant
Built on: Mon Jun 1 13:05:40 EDT 2020
Built host: dawson027.pppl.gov
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /usr/pppl/gcc/9.3.0/bin/gcc
C compiler family name: GNU
C compiler version: 9.3.0
C++ compiler: g++
C++ compiler absolute: /usr/pppl/gcc/9.3.0/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/pppl/gcc/9.3.0/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)