Within our cluster (debian10/slurm16, debian11/slurm20), with infiniband, and we have several instances of openmpi installed through the Lmod module system. When testing the openmpi installations with the mpi-test-suite 1.1 [1], it shows errors like these

...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX (27) number of values:1000 ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of errors may vary, but the first errors are always about
   ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there are no failed tests.

Typically, the openmpi/4.1.x installation is configured with
        ./configure --prefix=${PREFIX} \
                --with-ucx=$UCX_HOME \
                --enable-orterun-prefix-by-default  \
                --enable-mpi-cxx \
                --with-hwloc \
                --with-pmi \
                --with-pmix \
                --with-cuda=$CUDA_HOME \
                --with-slurm

but I've also tried different compilation options including w/ and w/o --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or compiled from source. But I could not identify any pattern.

Therefore, I'd like asking you what the issue might be. Specifically, I'm would like to know:

- Am I right in assuming that mpi-test-suite [1] suitable for testing openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
  Alois


[1] https://github.com/open-mpi/mpi-test-suite/t

Reply via email to