Hello Jeff,
As an experiment why not try
docker run /usr/local/mpi/bin/orted
?
and report the results?
Also, you may want to add –-debug-daemons to the mpirun command line as another
experiment.
Howard
From: users on behalf of Jeffrey Layton via
users
Reply-To: Open MPI Users
Date: Friday, September 27, 2024 at 1:08 PM
To: Open MPI Users
Cc: Jeffrey Layton
Subject: [EXTERNAL] [OMPI users] Issue with mpirun inside a container
Good afternoon,
I'm getting an error message when I run "mpirun ... " inside a Docker
container. The message:
bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--
In googling I know this is a fairly common error message. BTW - great error
message with good suggestions.
The actual mpirun command is:
/usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to none
--map-by slot \
python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL --nepochs=$NEPOCHS \
--fsdir=$FSDIR
This is called as the command for a "docker run ..." command.
I've tried a couple of things such as making a multiple command for "docker run
..." that sets $PATH and $LD_LIBRARY_PATH and I get the same message. BTW -
orted is located exactly where the error message indicated. I've tried not
using the FPQ for mpirun and just use "mpirun". I get the same error message.
I can run this "by hand" after starting the Docker container. I just run the
container "docker run ..." but without the mpirun command, and then I run a
simple script that defines the env variables and ends with the mpirun command;
this works correctly. But using Slurm or using ssh directly to a node causes
the above error message.
BTW - someone else built this container with Open MPI and I can't really change
it (I thought about rebuilding Open MPI in the container but I don't know the
details of how it was built).
Any thoughts?
Thanks!
Jeff