Good afternoon, I'm getting an error message when I run "mpirun ... " inside a Docker container. The message:
bash: line 1: /usr/local/mpi/bin/orted: No such file or directory -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- In googling I know this is a fairly common error message. BTW - great error message with good suggestions. The actual mpirun command is: /usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to none --map-by slot \ python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL --nepochs=$NEPOCHS \ --fsdir=$FSDIR This is called as the command for a "docker run ..." command. I've tried a couple of things such as making a multiple command for "docker run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same message. BTW - orted is located exactly where the error message indicated. I've tried not using the FPQ for mpirun and just use "mpirun". I get the same error message. I can run this "by hand" after starting the Docker container. I just run the container "docker run ..." but without the mpirun command, and then I run a simple script that defines the env variables and ends with the mpirun command; this works correctly. But using Slurm or using ssh directly to a node causes the above error message. BTW - someone else built this container with Open MPI and I can't really change it (I thought about rebuilding Open MPI in the container but I don't know the details of how it was built). Any thoughts? Thanks! Jeff