Within our cluster (debian10/slurm16, debian11/slurm20), with
infiniband, and we have several instances of openmpi installed through
the Lmod module system. When testing the openmpi installations with the
mpi-test-suite 1.1 [1], it shows errors like these
...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...
when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The number of
errors may vary, but the first errors are always about
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
When testing on openmpi/3.1.3, the tests runs successfully, and there
are no failed tests.
Typically, the openmpi/4.1.x installation is configured with
./configure --prefix=${PREFIX} \
--with-ucx=$UCX_HOME \
--enable-orterun-prefix-by-default \
--enable-mpi-cxx \
--with-hwloc \
--with-pmi \
--with-pmix \
--with-cuda=$CUDA_HOME \
--with-slurm
but I've also tried different compilation options including w/ and w/o
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or
compiled from source. But I could not identify any pattern.
Therefore, I'd like asking you what the issue might be. Specifically,
I'm would like to know:
- Am I right in assuming that mpi-test-suite [1] suitable for testing
openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?
Kind regards,
Alois
[1] https://github.com/open-mpi/mpi-test-suite/t