Hi,
to give further information about this problem... it seems not related
to MPI or UCX at all but seems to come from ParMETIS itself...
With ParMETIS installed from SPACK, with "+int64" option, I have been
able to use both OpenMPI 4.1.2 and IntelMPI 2021.6 successfully!
With ParMETIS installed by PETSc, with "--with-64-bit-indices=1" option,
all MPI implementations listed later do not work.
I've opened an issue at Petsc here:
https://gitlab.com/petsc/petsc/-/issues/1204#note_980344101
So, sorry for disturbing MPI guys here...
Thanks for all suggestions!
Eric
On 2022-06-01 23:31, Eric Chamberland via users wrote:
Hi,
In the past, we have successfully launched large sized (finite
elements) computations using PARMetis as mesh partitioner.
It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019
with OpenMPI 3.1.2 that we succeeded.
Today, we have a bunch of nightly (small) tests running nicely and
testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and
IntelMPI 2021.6.
Preparing for launching the same computation we did in 2012, and even
larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI
4.1.2+ucx-1.11.2 and launched computation from small to large problems
(meshes).
For small meshes, it goes fine.
But when we reach near 2^31 faces into the 3D mesh we are using and
call ParMETIS_V3_PartMeshKway, we always get a segfault with the same
backtrace pointing into ucx library:
Wed Jun 1 23:04:54
2022<stdout>:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut
VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM:
359012 <etiq_18>
Wed Jun 1 23:07:07 2022<stdout>:Erreur : MEF++ Signal recu : 11 :
segmentation violation
Wed Jun 1 23:07:07 2022<stdout>:Erreur :
Wed Jun 1 23:07:07 2022<stdout>:------------------------------ (Début
des informations destinées aux développeurs C++)
------------------------------
Wed Jun 1 23:07:07 2022<stdout>:La pile d'appels contient 27 symboles.
Wed Jun 1 23:07:07 2022<stdout>:# 000:
reqBacktrace(std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >&) >>> probGD.opt
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71)
[0x4119f1])
Wed Jun 1 23:07:07 2022<stdout>:# 001: attacheDebugger() >>>
probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a])
Wed Jun 1 23:07:07 2022<stdout>:# 002:
/gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f)
[0x2ab3aef0e5cf]
Wed Jun 1 23:07:07 2022<stdout>:# 003: /lib64/libc.so.6(+0x36400)
[0x2ab3bd59a400]
Wed Jun 1 23:07:07 2022<stdout>:# 004:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123)
[0x2ab3c966e353]
Wed Jun 1 23:07:07 2022<stdout>:# 005:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7)
[0x2ab3c968d6b7]
Wed Jun 1 23:07:07 2022<stdout>:# 006:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7)
[0x2ab3ca712137]
Wed Jun 1 23:07:07 2022<stdout>:# 007:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c)
[0x2ab3c968cd3c]
Wed Jun 1 23:07:07 2022<stdout>:# 008:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad)
[0x2ab3c9696dcd]
Wed Jun 1 23:07:07 2022<stdout>:# 009:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2)
[0x2ab3c922e0b2]
Wed Jun 1 23:07:07 2022<stdout>:# 010:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92)
[0x2ab3bbca5a32]
Wed Jun 1 23:07:07 2022<stdout>:# 011:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141)
[0x2ab3bbcad941]
Wed Jun 1 23:07:07 2022<stdout>:# 012:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42)
[0x2ab3d4836da2]
Wed Jun 1 23:07:07 2022<stdout>:# 013:
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(PMPI_Alltoallv+0x29)
[0x2ab3bbc7bdf9]
Wed Jun 1 23:07:07 2022<stdout>:# 014:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x106)
[0x2ab3bb0e1c06]
Wed Jun 1 23:07:07 2022<stdout>:# 015:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0xdd6)
[0x2ab3bb0f10b6]
Wed Jun 1 23:07:07 2022<stdout>:# 016:
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100)
[0x2ab3bb0f1ac0]
PARMetis is compiled as part of PETSc-3.17.1 with 64bit indices. Here
are PETSc configure options:
--prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1
COPTFLAGS=\"-O2 -march=native\"
CXXOPTFLAGS=\"-O2 -march=native\"
FOPTFLAGS=\"-O2 -march=native\"
--download-fftw=1
--download-hdf5=1
--download-hypre=1
--download-metis=1
--download-mumps=1
--download-parmetis=1
--download-plapack=1
--download-prometheus=1
--download-ptscotch=1
--download-scotch=1
--download-sprng=1
--download-superlu_dist=1
--download-triangle=1
--with-avx512-kernels=1
--with-blaslapack-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0
--with-cc=mpicc
--with-cxx=mpicxx
--with-cxx-dialect=C++11
--with-debugging=0
--with-fc=mpifort
--with-mkl_pardiso-dir=/scinet/intel/oneapi/2021u4/mkl/2021.4.0
--with-scalapack=1
--with-scalapack-lib=\"[/scinet/intel/oneapi/2021u4/mkl/2021.4.0/lib/intel64/libmkl_scalapack_lp64.so,/scinet/intel/oneapi/2021u4/mkl/2021.4.0/lib/intel64/libmkl_blacs_openmpi_lp64.so]\"
--with-x=0
--with-64-bit-indices=1
--with-memalign=64
and OpenMPI configure options:
'--prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2'
'--enable-mpi-cxx'
'--enable-mpi1-compatibility'
'--with-hwloc=internal'
'--with-knem=/opt/knem-1.1.3.90mlnx1'
'--with-libevent=internal'
'--with-platform=contrib/platform/mellanox/optimized'
'--with-pmix=internal'
'--with-slurm=/opt/slurm'
'--with-ucx=/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2'
I am then wondering:
1) Is UCX library considered "stable" for production use with very
large sized problems ?
2) Is there a way to "bypass" UCX at runtime?
3) Any idea for debugging this?
Of course, I do not yet have a "minimum reproducer" that bugs, since
it happens only on "large" problems, but I think I could export the
data for a 512 processes reproducer with PARMetis call only...
Thanks for helping,
Eric
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42