Hi,

I'm trying to build an OpenMPI 5.0.3 environment on the Cray EX HPC with 
Slingshot 10 support.

General speaking,  there were error messages while building OpenMPI,  and make 
check also didn't report any failure.

While tested OpenMPI Env. with a simple 'hello world' MPI Fortran codes,  it 
threw out these error messages and caught  signal 11 with libucs if specified 
'-mca btl ofi'.

No components were able to be opened in the btl framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded

Host: x3001c027b4n0
Framework: btl
-----------------------------------------------------------------------------------------------------
Caught signal 11 ( Segmentation fault: address not mapped to object at address 
(nil))

/project/app/ucx/1.12.1/lib/libucs.so.0 (ucs_handle_error+0x134)


This made me confused and not sure if got OpenMPI built with full Slingshot 10 
support successfully and run over Slingshot 10 properly.


Here are the building env.  on Cray EX HPC with SLES 15 SP3

    OpenMPI 5.0.3 + Intel 2022.0.2 + UCX 1.12.1 + libfabric 
1.11.0.4.125-SSHOT2.0.0 + mlnx-ofed 5.5.1

Here are my configurations

  --enable-mpi-fortran \
  --enable-shared \
  --with-pic \
  --with-ofi=/opt/cray/libfabric/1.11.0.4.125 \
  --with-ofi-libdir=/opt/cray/libfabric/1.11.0.4.125/lib64 \
  --with-ucx=/project/app/ucx/1.12.1 \
  --with-pmix=internal \
  --with-slingshot \
  --with-pbs \
  --with-tm=/opt/pbs \
  --with-singularity=/project/app/singularity/3.10.3 \
  --with-lustre=/usr \
  CC=icc \
  FC=ifort \
  CXX=icpc

Here are output of lspci on computing nodes

    03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family 
[ConnectX-5]
    24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
Connection (rev 01)

Here are what I'm confusing

  1. After the configuration completed, the pmix summary didn't tell slingshot 
support is turned on for the transports
  2. config.log didn't show any checking info. against slingshot while 
conducting mca checking,  just showed --with-slingshot was passed as an 
argument.
  3. Further looked into the configure script,  the only script which will 
check Slingshot support is 3rd-party/openmix/src/mca/pnet/sshot/configure.m4,  
but looked like it's never called,  as config.log didn't show any checking 
info. against appropriate dependencies, such as CXI, JANSSON, and I believed 
that CXI library was not installed on the machine.


Here are my questions


  1.
How it could tell OpenMPI was built with full Slingshot 10 support successfully 
based on ompi_info and ucx_info or some other info.  ?
  2.
Is CXI library just an optional package for OpenMPI getting Slingshot 10 
support ?
  3.
Which sort of mpirun arguments, like cma, pmi, etc.,  could be used to make 
sure MPI application running over Slingshot 10 properly ?
  4.
Which sort of OpenMPI parameters could be used for double checking runtime 
info.  over Slingshot 10 ?
  5.
Which sort of OpenMPI parameters could be used for tunning up performance over 
Slingshot 10 ?


Also attached output of 'ompi_info -a', 'ucx_info -d' for your reference.

Appreciating your time and comments.

Regards

Jerry


<<attachment: ompi_ucx_info.zip>>

Reply via email to