I have been having some difficulties getting the right combination of
SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
that both the srun method of starting jobs and mpirun/mpiexec will also run.

If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
both srun and mpirun and wouldn't mind sending me the version numbers and
any tips for getting this to work, I would appreciate it.

Should mpirun still work?  If that is just off the table and I missed the
memo, please let me know.

I'm asking for both because of programs like OpenFOAM and others where
mpirun is built into the application.  I have OMPI 1.10.7 built with
similar flags, and it seems to work.

[bennet@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000458

[bennet@beta-build mpi_example]$ mpirun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000295

SLURM documentation doesn't seem to list a recommended PMIx, that I can
find.  I can't find where the version of PMIx that is bundled with OMPI is
specified.

I have slurm 18.08.0, which is built against pmix-2.0.2.  We settled on
that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
Is OMPI 3.1.2 balking at too old a PMIx?

Sorry to be so at sea.

I built OMPI with

./configure \
    --prefix=${PREFIX} \
    --mandir=${PREFIX}/share/man \
    --with-pmix=/opt/pmix/2.0.2 \
    --with-libevent=external \
    --with-hwloc=external \
    --with-slurm \
    --with-verbs \
    --disable-dlopen --enable-shared \
    CC=gcc CXX=g++ FC=gfortran

I have a simple test program, and it runs with

[bennet@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  0.000573

but, on a login node, where I just want a few processors on the local node,
not to run on the compute nodes of the cluster, mpirun fails with

[bennet@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  store DAEMON URI failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[13610,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102536] 3 more processes have sent help
message help-orte-runtime.txt / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-orte-runtime / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-mpi-runtime.txt / mpi_init:startup:internal-failure
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to