Good afternoon,

I'm getting an error message when I run "mpirun ... " inside a Docker
container. The message:

bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

In googling I know this is a fairly common error message. BTW - great error
message with good suggestions.

The actual mpirun command is:

/usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to none
--map-by slot \
    python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL --nepochs=$NEPOCHS

This is called as the command for a "docker run ..." command.

I've tried a couple of things such as making a multiple command for "docker
run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same message.
BTW - orted is located exactly where the error message indicated. I've
tried not using the FPQ for mpirun and just use "mpirun". I get the same
error message.

I can run this "by hand" after starting the Docker container. I just run
the container "docker run ..." but without the mpirun command, and then I
run a simple script that defines the env variables and ends with the mpirun
command; this works correctly. But using Slurm or using ssh directly to a
node causes the above error message.

BTW - someone else built this container with Open MPI and I can't really
change it (I thought about rebuilding Open MPI in the container but I don't
know the details of how it was built).

Any thoughts?



