Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2

2022-06-10 Thread Eric Chamberland via users

Hi,

to give further information about this problem... it seems not related 
to MPI or UCX at all but seems to come from ParMETIS itself...


With ParMETIS installed from SPACK, with "+int64" option,  I have been 
able to use both OpenMPI 4.1.2 and IntelMPI 2021.6 successfully!


With ParMETIS installed by PETSc, with "--with-64-bit-indices=1" option, 
all MPI implementations listed later do not work.


I've opened an issue at Petsc here: 
https://gitlab.com/petsc/petsc/-/issues/1204#note_980344101


So, sorry for disturbing MPI guys here...

Thanks for all suggestions!

Eric

On 2022-06-01 23:31, Eric Chamberland via users wrote:


Hi,

In the past, we have successfully launched large sized (finite 
elements) computations using PARMetis as mesh partitioner.


It was first in 2012 with OpenMPI (v2.?) and secondly in March 2019 
with OpenMPI 3.1.2 that we succeeded.


Today, we have a bunch of nightly (small) tests running nicely and 
testing all of OpenMPI (4.0.x, 4.1.x and 5.0x), MPICH-3.3.2 and 
IntelMPI 2021.6.


Preparing for launching the same computation we did in 2012, and even 
larger ones, we compiled with bot OpenMPI 4.0.3+ucx-1.8.0 and OpenMPI 
4.1.2+ucx-1.11.2 and launched computation from small to large problems 
(meshes).


For small meshes, it goes fine.

But when we reach near 2^31 faces into the 3D mesh we are using and 
call ParMETIS_V3_PartMeshKway, we always get a segfault with the same 
backtrace pointing into ucx library:


Wed Jun  1 23:04:54 
2022:chrono::InterfaceParMetis::ParMETIS_V3_PartMeshKway::debut 
VmSize: 1202304 VmRSS: 349456 VmPeak: 1211736 VmData: 500764 VmHWM: 
359012 
Wed Jun  1 23:07:07 2022:Erreur    :  MEF++ Signal recu : 11 : 
 segmentation violation

Wed Jun  1 23:07:07 2022:Erreur    :
Wed Jun  1 23:07:07 2022:-- (Début 
des informations destinées aux développeurs C++) 
--

Wed Jun  1 23:07:07 2022:La pile d'appels contient 27 symboles.
Wed Jun  1 23:07:07 2022:# 000: 
reqBacktrace(std::__cxx11::basic_string, 
std::allocator >&)  >>>  probGD.opt 
(probGD.opt(_Z12reqBacktraceRNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x71) 
[0x4119f1])
Wed Jun  1 23:07:07 2022:# 001: attacheDebugger()  >>> 
 probGD.opt (probGD.opt(_Z15attacheDebuggerv+0x29a) [0x41386a])
Wed Jun  1 23:07:07 2022:# 002: 
/gpfs/fs0/project/d/deteix/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x1f9f) 
[0x2ab3aef0e5cf]
Wed Jun  1 23:07:07 2022:# 003: /lib64/libc.so.6(+0x36400) 
[0x2ab3bd59a400]
Wed Jun  1 23:07:07 2022:# 004: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_dt_pack+0x123) 
[0x2ab3c966e353]
Wed Jun  1 23:07:07 2022:# 005: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x536b7) 
[0x2ab3c968d6b7]
Wed Jun  1 23:07:07 2022:# 006: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/ucx/libuct_ib.so.0(uct_dc_mlx5_ep_am_bcopy+0xd7) 
[0x2ab3ca712137]
Wed Jun  1 23:07:07 2022:# 007: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(+0x52d3c) 
[0x2ab3c968cd3c]
Wed Jun  1 23:07:07 2022:# 008: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/ucx/1.11.2/lib/libucp.so.0(ucp_tag_send_nbx+0x5ad) 
[0x2ab3c9696dcd]
Wed Jun  1 23:07:07 2022:# 009: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xf2) 
[0x2ab3c922e0b2]
Wed Jun  1 23:07:07 2022:# 010: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x92) 
[0x2ab3bbca5a32]
Wed Jun  1 23:07:07 2022:# 011: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(ompi_coll_base_alltoallv_intra_pairwise+0x141) 
[0x2ab3bbcad941]
Wed Jun  1 23:07:07 2022:# 012: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoallv_intra_dec_fixed+0x42) 
[0x2ab3d4836da2]
Wed Jun  1 23:07:07 2022:# 013: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0/openmpi/4.1.2+ucx-1.11.2/lib/libmpi.so.40(PMPI_Alltoallv+0x29) 
[0x2ab3bbc7bdf9]
Wed Jun  1 23:07:07 2022:# 014: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(libparmetis__gkMPI_Alltoallv+0x106) 
[0x2ab3bb0e1c06]
Wed Jun  1 23:07:07 2022:# 015: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_Mesh2Dual+0xdd6) 
[0x2ab3bb0f10b6]
Wed Jun  1 23:07:07 2022:# 016: 
/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1/lib/libparmetis.so(ParMETIS_V3_PartMeshKway+0x100) 
[0x2ab3bb0f1ac0]


PARMetis is compiled as part of PETSc-3.17.1 with 64bit indices.  Here 
are PETSc configure options:


--prefix=/scinet/niagara/software/2022a/opt/gcc-11.2.0-openmpi-4.1.2+ucx-1.11.2/petsc-64bits/3.17.1
COPTFLAGS=\"-O2 -march=native\"
CXXOPTFLAGS=\"-O2 -march=native\"
FOPTFLAGS=\"-O

Re: [OMPI users] HPL: Error occurred in MPI_Recv

2022-06-10 Thread Bart Willems via users
No errors on any of the links. This is also not isolated to 1 or 2 nodes,
it happens on all cluster nodes.

Bart

On Thu, Jun 9, 2022 at 11:42 AM Collin Strassburger via users <
users@lists.open-mpi.org> wrote:

> Since it is happening on this cluster and not on others, have you checked
> the InfiniBand counters to ensure it’s not a bad cable or something along
> those lines?  I believe the command is ibdiag (or something similar).
>
>
>
> Collin
>
>
>
> *From:* users  *On Behalf Of *Bart
> Willems via users
> *Sent:* Thursday, June 9, 2022 12:32 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Bart Willems 
> *Subject:* [OMPI users] HPL: Error occurred in MPI_Recv
>
>
>
> *CAUTION - EXTERNAL EMAIL:* Do not click any links or open any
> attachments unless you trust the sender and know the content is safe.
>
> Hello,
>
>
>
> I am attempting to run High Performance Linpack (2.3) between 2 nodes with
> Open MPI 4.1.4 and MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64. Within a
> minute or so, the run always crashes with
>
>
>
> [node002:04556] *** An error occurred in MPI_Recv
> [node002:04556] *** reported by process [1007222785,24]
> [node002:04556] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [node002:04556] *** MPI_ERR_TRUNCATE: message truncated
> [node002:04556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [node002:04556] ***and potentially your MPI job)
>
>
>
> I have reverted back to Open MPI 4.1.2 with which I have had no issues on
> other systems, but the problem persists on this cluster.
>
>
>
> Any suggestions on steps to diagnose?
>
>
>
> Thank you,
>
> Bart
>


[OMPI users] MPI I/O, ROMIO and showing io mca parameters at run-time

2022-06-10 Thread Eric Chamberland via users

Hi,

I want to try romio with OpenMPI 4.1.2 because I am observing a big 
performance difference with IntelMPI on GPFS.


I want to see, at *runtime*, all parameters (default values, names) used 
by MPI (at least for the "io" framework).


I would like to have all the same output as "ompi_info --all" gives me...

I have tried this:

mpiexec --mca io romio321  --mca mca_verbose 1  --mca 
mpi_show_mca_params 1 --mca io_base_verbose 1 ...


But I cannot see anything about io coming out...

With "ompi_info" I do...

Is it possible?

Thanks,

Eric


--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42