Jeff, there are several options... First if you want to do containers and you are not tight to docker, singularity is a better fit. If you have a resource manager that features a PMIx server, you would simply direct run.
For example with SLURM: srun singularity exec container.sif a.out I do not know much about docker, but if it sets it own network, that make it tricky. One simple solution is to first spawn your containers and run a SSH daemon in them. Then do as before: docker run mpirun -H ... ... With Open MPI 4, you have the option to change the orted command line. you would simply use an orted wrapper like this docker run /usr/local/mpi/bin/orted "$@" and then mpirun --mca orte_launch_agent /.../orted_wrapper.sh -H ... Under the hood, Open MPI will ssh ... orted_wrapper.sh ... instead of the usual ssh ... orted ... Depending on how docker handles the network, YMMV. Hope this helps! Gilles On Tue, Oct 1, 2024 at 4:38 AM Jeffrey Layton <layto...@gmail.com> wrote: > Gilles, > > This was exactly it - thank you. > > If I wanted to run the code in the container across multiple nodes, I > would need to do something like "mpirun ... 'docker run ...' "? > > Thanks! > > Jeff > > > On Mon, Sep 30, 2024 at 2:38 AM Gilles Gouaillardet via users < > users@lists.open-mpi.org> wrote: > >> Jeffrey, >> >> You are invoking mpirun with the -H <hostfile> option, so basically >> mpirun inside your container will >> ssh ... orted ... >> but the remote orted will not run in a container, and hence the error >> message. >> Note it is possible you planned to run everything in the container, but >> for some reason Open MPI failed to figure >> out the name in the host file is the container, in this case, try without >> the -H option, or try using localhost in the host file. >> >> Cheers, >> >> Gilles >> >> On Mon, Sep 30, 2024 at 1:34 AM Jeffrey Layton via users < >> users@lists.open-mpi.org> wrote: >> >>> Howard, >>> >>> I tried the first experiment of using orted instead of mpirun. The >>> output is below. >>> >>> >>> /usr/local/mpi/bin/orted: Error: unknown option "-np" >>> Type '/usr/local/mpi/bin/orted --help' for usage. >>> Usage: /usr/local/mpi/bin/orted [OPTION]... >>> -d|--debug Debug the OpenRTE >>> --daemonize Daemonize the orted into the background >>> --debug-daemons Enable debugging of OpenRTE daemons >>> --debug-daemons-file Enable debugging of OpenRTE daemons, storing >>> output >>> in files >>> -h|--help This help message >>> --hnp Direct the orted to act as the HNP >>> --hnp-uri <arg0> URI for the HNP >>> -nodes|--nodes <arg0> >>> Regular expression defining nodes in system >>> -output-filename|--output-filename <arg0> >>> Redirect output from application processes into >>> filename.rank >>> --parent-uri <arg0> URI for the parent if tree launch is enabled. >>> -report-bindings|--report-bindings >>> Whether to report process bindings to stderr >>> --report-uri <arg0> Report this process' uri on indicated pipe >>> -s|--spin Have the orted spin until we can connect a >>> debugger >>> to it >>> --set-sid Direct the orted to separate from the current >>> session >>> --singleton-died-pipe <arg0> >>> Watch on indicated pipe for singleton >>> termination >>> --test-suicide <arg0> >>> Suicide instead of clean abort after delay >>> --tmpdir <arg0> Set the root for the session directory tree >>> -tree-spawn|--tree-spawn >>> Tree-based spawn in progress >>> -xterm|--xterm <arg0> >>> Create a new xterm window and display output >>> from >>> the specified ranks there >>> >>> For additional mpirun arguments, run 'mpirun --help <category>' >>> >>> The following categories exist: general (Defaults to this option), debug, >>> output, input, mapping, ranking, binding, devel (arguments useful to >>> OMPI >>> Developers), compatibility (arguments supported for backwards >>> compatibility), >>> launch (arguments to modify launch options), and dvm (Distributed >>> Virtual >>> Machine arguments). >>> >>> >>> >>> Then I tried adding the debug flag you mentioned and I got the same >>> error. " >>> >>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >>> >>> -------------------------------------------------------------------------- >>> ORTE was unable to reliably start one or more daemons. >>> >>> >>> I also tried a third experiment and tried using a container I have used >>> before. It has an older version of Open MPI but I get the same answer as I >>> get now, >>> >>> >>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >>> >>> -------------------------------------------------------------------------- >>> ORTE was unable to reliably start one or more daemons. >>> >>> >>> This is sounding like a path problem but I'm not sure. Adding the path >>> to MPI in $PATH and $LD_LIBRARY_PATH didn't change the error message. >>> >>> Thanks! >>> >>> Jeff >>> >>> >>> ------------------------------ >>> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard >>> Jr., Howard via users <users@lists.open-mpi.org> >>> *Sent:* Friday, September 27, 2024 4:40 PM >>> *To:* Open MPI Users <users@lists.open-mpi.org> >>> *Cc:* Pritchard Jr., Howard (EXTERNAL) <howa...@lanl.gov> >>> *Subject:* Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a >>> container >>> >>> *External email: Use caution opening links or attachments* >>> >>> Hello Jeff, >>> >>> >>> >>> As an experiment why not try >>> >>> >>> >>> docker run /usr/local/mpi/bin/orted >>> >>> >>> >>> ? >>> >>> >>> >>> and report the results? >>> >>> >>> >>> Also, you may want to add –-debug-daemons to the mpirun command line as >>> another experiment. >>> >>> >>> >>> Howard >>> >>> >>> >>> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey >>> Layton via users <users@lists.open-mpi.org> >>> *Reply-To: *Open MPI Users <users@lists.open-mpi.org> >>> *Date: *Friday, September 27, 2024 at 1:08 PM >>> *To: *Open MPI Users <users@lists.open-mpi.org> >>> *Cc: *Jeffrey Layton <layto...@gmail.com> >>> *Subject: *[EXTERNAL] [OMPI users] Issue with mpirun inside a container >>> >>> >>> >>> Good afternoon, >>> >>> >>> >>> I'm getting an error message when I run "mpirun ... " inside a Docker >>> container. The message: >>> >>> >>> >>> >>> >>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory >>> >>> -------------------------------------------------------------------------- >>> ORTE was unable to reliably start one or more daemons. >>> This usually is caused by: >>> >>> * not finding the required libraries and/or binaries on >>> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>> settings, or configure OMPI with --enable-orterun-prefix-by-default >>> >>> * lack of authority to execute on one or more specified nodes. >>> Please verify your allocation and authorities. >>> >>> * the inability to write startup files into /tmp >>> (--tmpdir/orte_tmpdir_base). >>> Please check with your sys admin to determine the correct location to >>> use. >>> >>> * compilation of the orted with dynamic libraries when static are >>> required >>> (e.g., on Cray). Please check your configure cmd line and consider >>> using >>> one of the contrib/platform definitions for your system type. >>> >>> * an inability to create a connection back to mpirun due to a >>> lack of common network interfaces and/or no route found between >>> them. Please check network connectivity (including firewalls >>> and network routing requirements). >>> >>> -------------------------------------------------------------------------- >>> >>> >>> >>> >>> >>> In googling I know this is a fairly common error message. BTW - great >>> error message with good suggestions. >>> >>> >>> >>> >>> >>> The actual mpirun command is: >>> >>> >>> >>> /usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to >>> none --map-by slot \ >>> python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL >>> --nepochs=$NEPOCHS \ >>> --fsdir=$FSDIR >>> >>> >>> >>> >>> >>> This is called as the command for a "docker run ..." command. >>> >>> >>> >>> I've tried a couple of things such as making a multiple command for >>> "docker run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same >>> message. BTW - orted is located exactly where the error message indicated. >>> I've tried not using the FPQ for mpirun and just use "mpirun". I get the >>> same error message. >>> >>> >>> >>> I can run this "by hand" after starting the Docker container. I just run >>> the container "docker run ..." but without the mpirun command, and then I >>> run a simple script that defines the env variables and ends with the mpirun >>> command; this works correctly. But using Slurm or using ssh directly to a >>> node causes the above error message. >>> >>> >>> >>> BTW - someone else built this container with Open MPI and I can't really >>> change it (I thought about rebuilding Open MPI in the container but I don't >>> know the details of how it was built). >>> >>> >>> >>> Any thoughts? >>> >>> >>> >>> Thanks! >>> >>> >>> >>> Jeff >>> >>> >>> >>