Artem - do you have any suggestions?
On Aug 8, 2019, at 12:06 PM, Jing Gong <gongj...@kth.se <mailto:gongj...@kth.se> > wrote: Hi Ralph, $ Did you remember to add "--mpi=pmix" to your srun cmd line? On the cluster, $ srun --mpi=list srun: MPI types are... srun: none srun: openmpi srun: pmi2 srun: pmix srun: pmix_v1 I have tested srun --mpi=pmi2/pmix/pmix_v1 but no one successful ran. Thanks. /Jing -------------------------------- From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> > Sent: Thursday, August 8, 2019 21:01 To: Jing Gong Subject: Re: [OMPI users] OMPI was not built with SLURM's PMI support Did you remember to add "--mpi=pmix" to your srun cmd line? Hi Ralph, The slurm seems to be configured with pmlx. $ ls /usr/lib64/slurm/ |grep pmi acct_gather_energy_ipmi.so mpi_pmi2.so mpi_pmix.so mpi_pmix_v1.so (and libpmix* in /usr/lib64) Anyway, I recompiled openmpi v3.0.0 with $ ./configure --with-pmix=/usr --with-slurm ... but this time I even could not run "mpirun" $ mpirun -n 4 ./a.out [[42812,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-3.0.0/orte/mca/ess/hnp/ess_hnp_module.c at line 649 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- What is the issue? Thanks a lot. /Jing Did you configure Slurm to use PMIx? If so, then you simply need to set the "--mpi=pmix" or "--mpi=pmix_v2" (depending on which version of PMIx you used) flag on your srun cmd line so it knows to use it. If not (and you can't fix it), then you have to explicitly configure OMPI to use Slurm's legacy PMI libraries - we won't do that by default. "./configure --help" will show you what needs to be done. See https://slurm.schedmd.com/mpi_guide.html for assistance on checking your Slurm config and setting it up with PMIx support Ralph On Aug 8, 2019, at 7:25 AM, Jing Gong via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Hi, Recently our Slurm system has been upgraded to 19.0.5. I tried to recompile openmpi v3.0 due to the bug reported in https://bugs.schedmd.com/show_bug.cgi?id=6993 The configure flags are: $./configure --enable-shared --enable-static --with-slurm --with-pmix and the output of ompi_info is following $ ompi_info -a |grep pmix Configure command line: '--enable-shared' '--enable-static' '--with-slurm' '--with-pmix' MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.0.0) MCA pmix: pmix2x (MCA v2.1.0, API v2.0.0, Component v3.0.0) MCA pmix base: --------------------------------------------------- MCA pmix base: parameter "pmix" (current value: "", data source: default, level: 2 user/detail, type: string) Default selection set of components for the pmix framework (<none> means use all components that can be found) MCA pmix base: --------------------------------------------------- MCA pmix base: parameter "pmix_base_verbose" (current value: "error", data source: default, level: 8 dev/detail, type: int) Verbosity level for the pmix framework (default: 0) MCA pmix base: parameter "pmix_base_async_modex" (current value: "false", data source: default, level: 9 dev/all, type: bool) MCA pmix base: parameter "pmix_base_collect_data" (current value: "true", data source: default, level: 9 dev/all, type: bool) MCA pmix base: parameter "pmix_base_exchange_timeout" (current value: "-1", data source: default, level: 3 user/all, type: int) MCA pmix pmix2x: --------------------------------------------------- MCA pmix pmix2x: parameter "pmix_pmix2x_silence_warning" (current value: "false", data source: default, level: 4 tuner/basic, type: bool) But when srun the openmpi, I got error likes ==== $ srun -n 4 ./a.out -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! === How can I check if openmpi is built for the PMI support ? Thanks a lot. /Jing _______________________________________________ users mailing list users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users