Hello Jeff,

As an experiment why not try

docker run  /usr/local/mpi/bin/orted

?

and report the results?

Also, you may want to add –-debug-daemons to the mpirun command line as another 
experiment.

Howard

From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via 
users <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Friday, September 27, 2024 at 1:08 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Jeffrey Layton <layto...@gmail.com>
Subject: [EXTERNAL] [OMPI users] Issue with mpirun inside a container

Good afternoon,

I'm getting an error message when I run "mpirun ... " inside a Docker 
container. The message:


bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------


In googling I know this is a fairly common error message. BTW - great error 
message with good suggestions.


The actual mpirun command is:

/usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to none 
--map-by slot \
    python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL --nepochs=$NEPOCHS \
    --fsdir=$FSDIR


This is called as the command for a "docker run ..." command.

I've tried a couple of things such as making a multiple command for "docker run 
..." that sets $PATH and $LD_LIBRARY_PATH and I get the same message. BTW - 
orted is located exactly where the error message indicated. I've tried not 
using the FPQ for mpirun and just use "mpirun". I get the same error message.

I can run this "by hand" after starting the Docker container. I just run the 
container "docker run ..." but without the mpirun command, and then I run a 
simple script that defines the env variables and ends with the mpirun command; 
this works correctly. But using Slurm or using ssh directly to a node causes 
the above error message.

BTW - someone else built this container with Open MPI and I can't really change 
it (I thought about rebuilding Open MPI in the container but I don't know the 
details of how it was built).

Any thoughts?

Thanks!

Jeff

Reply via email to