Artem - do you have any suggestions?

On Aug 8, 2019, at 12:06 PM, Jing Gong <gongj...@kth.se 
<mailto:gongj...@kth.se> > wrote:

Hi Ralph,

$ Did you remember to add "--mpi=pmix" to your srun cmd line?

On the cluster,

$ srun  --mpi=list
srun: MPI types are...
srun: none
srun: openmpi
srun: pmi2
srun: pmix
srun: pmix_v1

I have tested srun --mpi=pmi2/pmix/pmix_v1 but no one successful ran.

Thanks. /Jing


--------------------------------
From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> >
Sent: Thursday, August 8, 2019 21:01
To: Jing Gong
Subject: Re: [OMPI users] OMPI was not built with SLURM's PMI support
 Did you remember to add "--mpi=pmix" to your srun cmd line?


Hi Ralph,

The slurm seems to be configured with pmlx. 

$ ls /usr/lib64/slurm/ |grep pmi
acct_gather_energy_ipmi.so
mpi_pmi2.so
mpi_pmix.so
mpi_pmix_v1.so

(and libpmix* in /usr/lib64)
 
Anyway, I recompiled openmpi v3.0.0 with

$ ./configure --with-pmix=/usr --with-slurm ...

but this time I even could not run "mpirun"

$ mpirun -n 4 ./a.out 
 [[42812,0],0] ORTE_ERROR_LOG: Not found in file 
../../../../../openmpi-3.0.0/orte/mca/ess/hnp/ess_hnp_module.c at line 649
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

What is the issue?

Thanks a lot.

/Jing

Did you configure Slurm to use PMIx? If so, then you simply need to set the 
"--mpi=pmix" or "--mpi=pmix_v2" (depending on which version of PMIx you used) 
flag on your srun cmd line so it knows to use it.

If not (and you can't fix it), then you have to explicitly configure OMPI to 
use Slurm's legacy PMI libraries - we won't do that by default. "./configure 
--help" will show you what needs to be done.

See https://slurm.schedmd.com/mpi_guide.html for assistance on checking your 
Slurm config and setting it up with PMIx support
Ralph


On Aug 8, 2019, at 7:25 AM, Jing Gong via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org> > wrote:

Hi,

Recently our Slurm system has been upgraded to 19.0.5. I tried to recompile 
openmpi v3.0 due to the bug reported in

https://bugs.schedmd.com/show_bug.cgi?id=6993

The configure flags are:

$./configure --enable-shared --enable-static --with-slurm --with-pmix

and the output of ompi_info is following

$ ompi_info -a |grep pmix
  Configure command line: '--enable-shared' '--enable-static' '--with-slurm' 
'--with-pmix'

   MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.0.0)
     MCA pmix: pmix2x (MCA v2.1.0, API v2.0.0, Component v3.0.0)
           MCA pmix base: ---------------------------------------------------
           MCA pmix base: parameter "pmix" (current value: "", data source: 
default, level: 2 user/detail, type: string)
                          Default selection set of components for the pmix 
framework (<none> means use all components that can be found)
           MCA pmix base: ---------------------------------------------------
           MCA pmix base: parameter "pmix_base_verbose" (current value: 
"error", data source: default, level: 8 dev/detail, type: int)
                          Verbosity level for the pmix framework (default: 0)
           MCA pmix base: parameter "pmix_base_async_modex" (current value: 
"false", data source: default, level: 9 dev/all, type: bool)
           MCA pmix base: parameter "pmix_base_collect_data" (current value: 
"true", data source: default, level: 9 dev/all, type: bool)
           MCA pmix base: parameter "pmix_base_exchange_timeout" (current 
value: "-1", data source: default, level: 3 user/all, type: int)
         MCA pmix pmix2x: ---------------------------------------------------
         MCA pmix pmix2x: parameter "pmix_pmix2x_silence_warning" (current 
value: "false", data source: default, level: 4 tuner/basic, type: bool)

But when srun the openmpi, I got error likes

====
$ srun -n 4 ./a.out

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or

  PMI-2 support. SLURM builds PMI-1 by default, or you can manually

  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

***    and potentially your MPI job)
Local abort before MPI_INIT completed completed successfully, but am not able 
to aggregate error messages, and not able to guarantee that all other processes 
were killed!
===

How can I check if openmpi is built for the PMI support ?

Thanks a lot. /Jing 




_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to