Gilles, This was exactly it - thank you.
If I wanted to run the code in the container across multiple nodes, I would need to do something like "mpirun ... 'docker run ...' "? Thanks! Jeff On Mon, Sep 30, 2024 at 2:38 AM Gilles Gouaillardet via users < users@lists.open-mpi.org> wrote: > Jeffrey, > > You are invoking mpirun with the -H <hostfile> option, so basically mpirun > inside your container will > ssh ... orted ... > but the remote orted will not run in a container, and hence the error > message. > Note it is possible you planned to run everything in the container, but > for some reason Open MPI failed to figure > out the name in the host file is the container, in this case, try without > the -H option, or try using localhost in the host file. > > Cheers, > > Gilles > > On Mon, Sep 30, 2024 at 1:34 AM Jeffrey Layton via users < > users@lists.open-mpi.org> wrote: > >> Howard, >> >> I tried the first experiment of using orted instead of mpirun. The output >> is below. >> >> >> /usr/local/mpi/bin/orted: Error: unknown option "-np" >> Type '/usr/local/mpi/bin/orted --help' for usage. >> Usage: /usr/local/mpi/bin/orted [OPTION]... >> -d|--debug Debug the OpenRTE >> --daemonize Daemonize the orted into the background >> --debug-daemons Enable debugging of OpenRTE daemons >> --debug-daemons-file Enable debugging of OpenRTE daemons, storing >> output >> in files >> -h|--help This help message >> --hnp Direct the orted to act as the HNP >> --hnp-uri <arg0> URI for the HNP >> -nodes|--nodes <arg0> >> Regular expression defining nodes in system >> -output-filename|--output-filename <arg0> >> Redirect output from application processes into >> filename.rank >> --parent-uri <arg0> URI for the parent if tree launch is enabled. >> -report-bindings|--report-bindings >> Whether to report process bindings to stderr >> --report-uri <arg0> Report this process' uri on indicated pipe >> -s|--spin Have the orted spin until we can connect a >> debugger >> to it >> --set-sid Direct the orted to separate from the current >> session >> --singleton-died-pipe <arg0> >> Watch on indicated pipe for singleton termination >> --test-suicide <arg0> >> Suicide instead of clean abort after delay >> --tmpdir <arg0> Set the root for the session directory tree >> -tree-spawn|--tree-spawn >> Tree-based spawn in progress >> -xterm|--xterm <arg0> >> Create a new xterm window and display output from >> the specified ranks there >> >> For additional mpirun arguments, run 'mpirun --help <category>' >> >> The following categories exist: general (Defaults to this option), debug, >> output, input, mapping, ranking, binding, devel (arguments useful to >> OMPI >> Developers), compatibility (arguments supported for backwards >> compatibility), >> launch (arguments to modify launch options), and dvm (Distributed >> Virtual >> Machine arguments). >> >> >> >> Then I tried adding the debug flag you mentioned and I got the same >> error. " >> >> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >> -------------------------------------------------------------------------- >> ORTE was unable to reliably start one or more daemons. >> >> >> I also tried a third experiment and tried using a container I have used >> before. It has an older version of Open MPI but I get the same answer as I >> get now, >> >> >> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >> -------------------------------------------------------------------------- >> ORTE was unable to reliably start one or more daemons. >> >> >> This is sounding like a path problem but I'm not sure. Adding the path to >> MPI in $PATH and $LD_LIBRARY_PATH didn't change the error message. >> >> Thanks! >> >> Jeff >> >> >> ------------------------------ >> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard >> Jr., Howard via users <users@lists.open-mpi.org> >> *Sent:* Friday, September 27, 2024 4:40 PM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> *Cc:* Pritchard Jr., Howard (EXTERNAL) <howa...@lanl.gov> >> *Subject:* Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a >> container >> >> *External email: Use caution opening links or attachments* >> >> Hello Jeff, >> >> >> >> As an experiment why not try >> >> >> >> docker run /usr/local/mpi/bin/orted >> >> >> >> ? >> >> >> >> and report the results? >> >> >> >> Also, you may want to add –-debug-daemons to the mpirun command line as >> another experiment. >> >> >> >> Howard >> >> >> >> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey >> Layton via users <users@lists.open-mpi.org> >> *Reply-To: *Open MPI Users <users@lists.open-mpi.org> >> *Date: *Friday, September 27, 2024 at 1:08 PM >> *To: *Open MPI Users <users@lists.open-mpi.org> >> *Cc: *Jeffrey Layton <layto...@gmail.com> >> *Subject: *[EXTERNAL] [OMPI users] Issue with mpirun inside a container >> >> >> >> Good afternoon, >> >> >> >> I'm getting an error message when I run "mpirun ... " inside a Docker >> container. The message: >> >> >> >> >> >> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >> -------------------------------------------------------------------------- >> ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> >> * not finding the required libraries and/or binaries on >> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >> settings, or configure OMPI with --enable-orterun-prefix-by-default >> >> * lack of authority to execute on one or more specified nodes. >> Please verify your allocation and authorities. >> >> * the inability to write startup files into /tmp >> (--tmpdir/orte_tmpdir_base). >> Please check with your sys admin to determine the correct location to >> use. >> >> * compilation of the orted with dynamic libraries when static are >> required >> (e.g., on Cray). Please check your configure cmd line and consider using >> one of the contrib/platform definitions for your system type. >> >> * an inability to create a connection back to mpirun due to a >> lack of common network interfaces and/or no route found between >> them. Please check network connectivity (including firewalls >> and network routing requirements). >> -------------------------------------------------------------------------- >> >> >> >> >> >> In googling I know this is a fairly common error message. BTW - great >> error message with good suggestions. >> >> >> >> >> >> The actual mpirun command is: >> >> >> >> /usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to >> none --map-by slot \ >> python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL >> --nepochs=$NEPOCHS \ >> --fsdir=$FSDIR >> >> >> >> >> >> This is called as the command for a "docker run ..." command. >> >> >> >> I've tried a couple of things such as making a multiple command for >> "docker run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same >> message. BTW - orted is located exactly where the error message indicated. >> I've tried not using the FPQ for mpirun and just use "mpirun". I get the >> same error message. >> >> >> >> I can run this "by hand" after starting the Docker container. I just run >> the container "docker run ..." but without the mpirun command, and then I >> run a simple script that defines the env variables and ends with the mpirun >> command; this works correctly. But using Slurm or using ssh directly to a >> node causes the above error message. >> >> >> >> BTW - someone else built this container with Open MPI and I can't really >> change it (I thought about rebuilding Open MPI in the container but I don't >> know the details of how it was built). >> >> >> >> Any thoughts? >> >> >> >> Thanks! >> >> >> >> Jeff >> >> >> >