Hi, I'm trying to build an OpenMPI 5.0.3 environment on the Cray EX HPC with Slingshot 10 support.
General speaking, there were error messages while building OpenMPI, and make check also didn't report any failure. While tested OpenMPI Env. with a simple 'hello world' MPI Fortran codes, it threw out these error messages and caught signal 11 with libucs if specified '-mca btl ofi'. No components were able to be opened in the btl framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded Host: x3001c027b4n0 Framework: btl ----------------------------------------------------------------------------------------------------- Caught signal 11 ( Segmentation fault: address not mapped to object at address (nil)) /project/app/ucx/1.12.1/lib/libucs.so.0 (ucs_handle_error+0x134) This made me confused and not sure if got OpenMPI built with full Slingshot 10 support successfully and run over Slingshot 10 properly. Here are the building env. on Cray EX HPC with SLES 15 SP3 OpenMPI 5.0.3 + Intel 2022.0.2 + UCX 1.12.1 + libfabric 1.11.0.4.125-SSHOT2.0.0 + mlnx-ofed 5.5.1 Here are my configurations --enable-mpi-fortran \ --enable-shared \ --with-pic \ --with-ofi=/opt/cray/libfabric/1.11.0.4.125 \ --with-ofi-libdir=/opt/cray/libfabric/1.11.0.4.125/lib64 \ --with-ucx=/project/app/ucx/1.12.1 \ --with-pmix=internal \ --with-slingshot \ --with-pbs \ --with-tm=/opt/pbs \ --with-singularity=/project/app/singularity/3.10.3 \ --with-lustre=/usr \ CC=icc \ FC=ifort \ CXX=icpc Here are output of lspci on computing nodes 03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Here are what I'm confusing 1. After the configuration completed, the pmix summary didn't tell slingshot support is turned on for the transports 2. config.log didn't show any checking info. against slingshot while conducting mca checking, just showed --with-slingshot was passed as an argument. 3. Further looked into the configure script, the only script which will check Slingshot support is 3rd-party/openmix/src/mca/pnet/sshot/configure.m4, but looked like it's never called, as config.log didn't show any checking info. against appropriate dependencies, such as CXI, JANSSON, and I believed that CXI library was not installed on the machine. Here are my questions 1. How it could tell OpenMPI was built with full Slingshot 10 support successfully based on ompi_info and ucx_info or some other info. ? 2. Is CXI library just an optional package for OpenMPI getting Slingshot 10 support ? 3. Which sort of mpirun arguments, like cma, pmi, etc., could be used to make sure MPI application running over Slingshot 10 properly ? 4. Which sort of OpenMPI parameters could be used for double checking runtime info. over Slingshot 10 ? 5. Which sort of OpenMPI parameters could be used for tunning up performance over Slingshot 10 ? Also attached output of 'ompi_info -a', 'ucx_info -d' for your reference. Appreciating your time and comments. Regards Jerry
<<attachment: ompi_ucx_info.zip>>