I'm not sure I understand why you are trying to build CentOS rpms for PMIx, Slurm, or OMPI - all three are readily available online. Is there some particular reason you are trying to do this yourself? I ask because it is non-trivial to do and requires significant familiarity with both the intricacies of bpm building and the packages involved.
On May 11, 2020, at 6:23 AM, Leandro via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Hi, I'm trying to start using Slurm, and I followed all the instructions ti build PMIx, Slurm using pmix, but I can't make openmpi to work. According to PMIx documentation, I should compile openmpi using "--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as CentOS rpms. Thanks in advance for your help. I pasted some info below. libtool: link: /tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\" -DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May 8 13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\" -DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May 8 13:47:32 -03 2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions -fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include \"" "-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\"" "-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../.. \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\" "-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel -L/usr/lib64\"" "-DOMPI_BUILD_LIBS=\"-lrt -lutil -lz -lhwloc -levent -levent_pthreads\"" -DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3 -finline-functions -fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o .libs/ompi_info ompi_info.o param.o -L/usr/lib64 ../../../ompi/.libs/libmpi.so -L/usr/lib -llustreapi /root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so ../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct -lrdmacm -libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz /usr/lib64/libhwloc.so -lm -ludev -lltdl -levent -levent_pthreads -pthread -Wl,-rpath -Wl,/usr/lib64 icc: warning #10237: -lcilkrts linked in dynamically, static library not available ../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info' ../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help' make[2]: *** [ompi_info] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build) The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it builds, but neither mpirun or srun works: c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file gr10b17n05 gr10b17n06 gr10b17n07 gr10b17n08 c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile machine_file ./mpihello [gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file base/ess_base_std_orted.c at line 362 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [gr10b17n08:142030] [[21391,0],3] ORTE_ERROR_LOG: Not found in file base/ess_base_std_orted.c at line 362 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: gr10b17n05 target node: gr10b17n06 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. -------------------------------------------------------------------------- [gr10b17n05:171586] 1 more process has sent help message help-errmgr-base.txt / no-path [gr10b17n05:171586] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > -------------------------- c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 --machinefile machine_file mpihello [gr10pbs2:242828] [[60566,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 --machinefile machine_file mpihello [gr10pbs2:237314] [[50968,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun -N4 /bw1nfs1/Projetos1/c315/Meus_testes/mpihello *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [gr10b17n05:172693] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: gr10b17n05: task 0: Exited with exit code 1 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): getting job size failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): getting job size failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- [gr10b17n07:116175] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [gr10b17n06:142082] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): getting job size failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "Not found" (-13) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [gr10b17n08:143134] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: gr10b17n07: task 2: Exited with exit code 1 srun: error: gr10b17n06: task 1: Exited with exit code 1 srun: error: gr10b17n08: task 3: Exited with exit code 1 c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > Slurm information: c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun --mpi=list srun: MPI types are... srun: pmix_v3 srun: none srun: pmi2 srun: pmix The compilation lines used for PMIx and openmpi: MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define "configure_options --enable-shared --enable-static --with-jansson=/usr --with-libevent=/usr --with-libevent-libdir=/usr/lib64 --with-hwloc=/usr --with-curl=/usr --without-opamgt --with-munge=/usr --with-lustre=/usr --enable-pmix-timing --enable-pmi-backward-compatibility --enable-pmix-binaries --with-devel-headers --with-tests-examples --disable-mca-dso --disable-weak-symbols AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3 FCFLAGS=-O3 F77FLAGS=-O3 F90FLAGS=-O3 CXXFLAGS=-O3 MFLAGS='-j24 V99'" pmix-3.1.5.spec MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define "configure_options --enable-shared --enable-static --with-libevent=/usr --with-libevent-libdir=/usr/lib64 --with-pmix=/usr --with-pmix-libdir=/usr/lib64 --enable-install-libpmix --with-ompi-pmix-rte --without-orte --with-slurm --with-ucx=/usr --with-cuda=/usr/local/cuda --with-gdrcopy=/usr --with-hwloc --enable-mpi-cxx --disable-mca-dso --enable-mpi-fortran --disable-weak-symbols --enable-mpi-thread-multiple --enable-contrib-no-build=vt --enable-mpirun-prefix-by-default --enable-orterun-prefix-by-default --with-cuda=/usr/local/cuda AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3 FCFLAGS=-O3 F77FLAGS=-O3 F90FLAGS=-O3 CXXFLAGS=-O3 MFLAGS='-j24 V99'" openmpi-4.0.2.spec 2>&1 | tee /root/openmpi-2.log --- Leandro