Try adding --without-psm2 to the PMIx configure line - sounds like you have 
that library installed on your machine, even though you don't have omnipath.


On May 12, 2020, at 4:42 AM, Leandro via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

HI, 

I compile it statically to make sure compilers libraries will not be a 
dependency, and I do this way for years. The developers said they want this 
way, so I did.

I saw this warning and this library is related with omnipath, which we don't 
have.

---
Leandro


On Tue, May 12, 2020 at 8:27 AM Jeff Squyres (jsquyres) <jsquy...@cisco.com 
<mailto:jsquy...@cisco.com> > wrote:
It looks like you are building both static and dynamic libraries 
(--enable-static and --enable-shared).  This might be confusing the issue -- I 
can see at least one warning:

icc: warning #10237: -lcilkrts linked in dynamically, static library not 
available

It's not easy to tell from the snippets you sent what other downstream side 
effects this might have.

Is there a reason to compile statically?  It generally leads to (much) bigger 
executables, and far less memory efficiency (i.e., the library is not shared in 
memory between all the MPI processes running on each node).  Also, the link 
phase of compilers tends to prefer shared libraries, so unless your apps are 
compiled/linked with whatever the compiler's "link this statically" flags are, 
it's going to likely default to using the shared libraries.

This is a long way of saying: try building everything with just --enable-shared 
(and not --enable-static).  Or possibly just remove both flags; --enable-shared 
is the default.




On May 11, 2020, at 9:23 AM, Leandro via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

Hi,

I'm trying to start using Slurm, and I followed all the instructions ti build 
PMIx, Slurm using pmix, but I can't make openmpi to work.

According to PMIx documentation, I should compile openmpi using 
"--with-ompi-pmix-rte" but when I tried, It fails. I need to build this as 
CentOS rpms.

Thanks in advance for your help. I pasted some info below.

libtool: link: 
/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
 -std=gnu99 -std=gnu99 -DOPAL_CONFIGURE_USER=\"root\" 
-DOPAL_CONFIGURE_HOST=\"gr10b17n05\" "-DOPAL_CONFIGURE_DATE=\"Fri May  8 
13:35:51 -03 2020\"" -DOMPI_BUILD_USER=\"root\" 
-DOMPI_BUILD_HOST=\"gr10b17n05\" "-DOMPI_BUILD_DATE=\"Fri May  8 13:47:32 -03 
2020\"" "-DOMPI_BUILD_CFLAGS=\"-DNDEBUG -O3 -finline-functions 
-fno-strict-aliasing -restrict -Qoption,cpp,--extended_float_types -pthread\"" 
"-DOMPI_BUILD_CPPFLAGS=\"-I../../.. -I../../../orte/include    \"" 
"-DOMPI_BUILD_CXXFLAGS=\"-DNDEBUG -O3 -finline-functions -pthread\"" 
"-DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \"" "-DOMPI_BUILD_FFLAGS=\"-O2 -g -pipe 
-Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong 
--param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic 
-I/usr/lib64/gfortran/modules\"" -DOMPI_BUILD_FCFLAGS=\"-O3\" 
"-DOMPI_BUILD_LDFLAGS=\"-Wc,-static-intel -static-intel    -L/usr/lib64\"" 
"-DOMPI_BUILD_LIBS=\"-lrt -lutil  -lz  -lhwloc  -levent -levent_pthreads\"" 
-DOPAL_CC_ABSOLUTE=\"\" -DOMPI_CXX_ABSOLUTE=\"none\" -DNDEBUG -O3 
-finline-functions -fno-strict-aliasing -restrict 
-Qoption,cpp,--extended_float_types -pthread -static-intel -static-intel -o 
.libs/ompi_info ompi_info.o param.o  -L/usr/lib64 ../../../ompi/.libs/libmpi.so 
-L/usr/lib -llustreapi 
/root/rpmbuild/BUILD/openmpi-4.0.2/opal/.libs/libopen-pal.so 
../../../opal/.libs/libopen-pal.so -lfabric -lucp -lucm -lucs -luct -lrdmacm 
-libverbs /usr/lib64/libpmix.so -lmunge -lrt -lutil -lz /usr/lib64/libhwloc.so 
-lm -ludev -lltdl -levent -levent_pthreads -pthread -Wl,-rpath -Wl,/usr/lib64
icc: warning #10237: -lcilkrts linked in dynamically, static library not 
available
../../../ompi/.libs/libmpi.so: undefined reference to `orte_process_info'
../../../ompi/.libs/libmpi.so: undefined reference to `orte_show_help'

make[2]: *** [ompi_info] Error 1
make[2]: Leaving directory 
`/root/rpmbuild/BUILD/openmpi-4.0.2/ompi/tools/ompi_info'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/openmpi-4.0.2/ompi'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.RyklCR (%build)

The orte libraries are missing. When I don't use "-with-ompi-pmix-rte" it 
builds, but neither mpirun or srun works:

c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > cat machine_file 
gr10b17n05
gr10b17n06
gr10b17n07
gr10b17n08
c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun -machinefile 
machine_file ./mpihello 
[gr10b17n07:115065] [[21391,0],2] ORTE_ERROR_LOG: Not found in file 
base/ess_base_std_orted.c at line 362
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[gr10b17n08:142030] [[21391,0],3] ORTE_ERROR_LOG: Not found in file 
base/ess_base_std_orted.c at line 362
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   gr10b17n05
  target node:  gr10b17n06

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[gr10b17n05:171586] 1 more process has sent help message help-errmgr-base.txt / 
no-path
[gr10b17n05:171586] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
c315@gr10b17n05 /bw1nfs1/Projetos1/c315/Meus_testes > 

--------------------------
c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 
--machinefile machine_file mpihello 
[gr10pbs2:242828] [[60566,0],0] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > mpirun --nolocal -np 1 
--machinefile machine_file mpihello 
[gr10pbs2:237314] [[50968,0],0] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > 

c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun -N4 
/bw1nfs1/Projetos1/c315/Meus_testes/mpihello
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[gr10b17n05:172693] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
srun: error: gr10b17n05: task 0: Exited with exit code 1
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting job size failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting job size failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[gr10b17n07:116175] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[gr10b17n06:142082] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting job size failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
[gr10b17n08:143134] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
srun: error: gr10b17n07: task 2: Exited with exit code 1
srun: error: gr10b17n06: task 1: Exited with exit code 1
srun: error: gr10b17n08: task 3: Exited with exit code 1
c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > 

Slurm information:
c315@gr10pbs2 /bw1nfs1/Projetos1/c315/Meus_testes > srun --mpi=list
srun: MPI types are...
srun: pmix_v3
srun: none
srun: pmi2
srun: pmix

The compilation lines used for PMIx and openmpi:

MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define 
"configure_options --enable-shared --enable-static --with-jansson=/usr 
--with-libevent=/usr --with-libevent-libdir=/usr/lib64 --with-hwloc=/usr 
--with-curl=/usr --without-opamgt --with-munge=/usr --with-lustre=/usr 
--enable-pmix-timing --enable-pmi-backward-compatibility --enable-pmix-binaries 
--with-devel-headers --with-tests-examples --disable-mca-dso 
--disable-weak-symbols 
AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
 
LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
 
CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
 
FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
 LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3 F77FLAGS=-O3 
 F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'" pmix-3.1.5.spec 

MAKEFLAGS="-j24 V=99" rpmbuild -ba --define 'install_in_opt 0' --define 
"configure_options --enable-shared --enable-static --with-libevent=/usr 
--with-libevent-libdir=/usr/lib64 --with-pmix=/usr 
--with-pmix-libdir=/usr/lib64 --enable-install-libpmix --with-ompi-pmix-rte 
--without-orte --with-slurm --with-ucx=/usr --with-cuda=/usr/local/cuda 
--with-gdrcopy=/usr --with-hwloc --enable-mpi-cxx --disable-mca-dso 
--enable-mpi-fortran --disable-weak-symbols --enable-mpi-thread-multiple 
--enable-contrib-no-build=vt --enable-mpirun-prefix-by-default 
--enable-orterun-prefix-by-default --with-cuda=/usr/local/cuda 
AR=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xiar
 
LD=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/xild
 
CC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icc
 
FC=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
F90=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
F77=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/ifort
 
CXX=/tgdesenv/dist/compiladores/intel/compilers_and_libraries_2019.5.281/linux/bin/intel64/icpc
 LDFLAGS='-Wc,-static-intel -static-intel' CFLAGS=-O3  FCFLAGS=-O3 F77FLAGS=-O3 
 F90FLAGS=-O3  CXXFLAGS=-O3 MFLAGS='-j24 V99'" openmpi-4.0.2.spec 2>&1 | tee 
/root/openmpi-2.log

---
Leandro


-- 
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com> 


Reply via email to