Jeff,

there are several options...
First if you want to do containers and you are not tight to docker,
singularity is a better fit.
If you have a resource manager that features a PMIx server, you would
simply direct run.

For example with SLURM:
srun singularity exec container.sif a.out

I do not know much about docker, but if it sets it own network, that make
it tricky.
One simple solution is to first spawn your containers and run a SSH daemon
in them.
Then do as before:
docker run mpirun -H ... ...

With Open MPI 4, you have the option to change the orted command line.
you would simply use an orted wrapper like this

docker run /usr/local/mpi/bin/orted "$@"

and then

mpirun --mca orte_launch_agent /.../orted_wrapper.sh -H ...

Under the hood, Open MPI will
ssh ... orted_wrapper.sh ...
instead of the usual
ssh ... orted ...

Depending on how docker handles the network, YMMV.


Hope this helps!

Gilles

On Tue, Oct 1, 2024 at 4:38 AM Jeffrey Layton <layto...@gmail.com> wrote:

> Gilles,
>
> This was exactly it - thank you.
>
> If I wanted to run the code in the container across multiple nodes, I
> would need to do something like "mpirun ... 'docker run ...' "?
>
> Thanks!
>
> Jeff
>
>
> On Mon, Sep 30, 2024 at 2:38 AM Gilles Gouaillardet via users <
> users@lists.open-mpi.org> wrote:
>
>> Jeffrey,
>>
>> You are invoking mpirun with the -H <hostfile> option, so basically
>> mpirun inside your container will
>> ssh ... orted ...
>> but the remote orted will not run in a container, and hence the error
>> message.
>> Note it is possible you planned to run everything in the container, but
>> for some reason Open MPI failed to figure
>> out the name in the host file is the container, in this case, try without
>> the -H option, or try using localhost in the host file.
>>
>> Cheers,
>>
>> Gilles
>>
>> On Mon, Sep 30, 2024 at 1:34 AM Jeffrey Layton via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Howard,
>>>
>>> I tried the first experiment of using orted instead of mpirun. The
>>> output is below.
>>>
>>>
>>> /usr/local/mpi/bin/orted: Error: unknown option "-np"
>>> Type '/usr/local/mpi/bin/orted --help' for usage.
>>> Usage: /usr/local/mpi/bin/orted [OPTION]...
>>> -d|--debug               Debug the OpenRTE
>>>    --daemonize           Daemonize the orted into the background
>>>    --debug-daemons       Enable debugging of OpenRTE daemons
>>>    --debug-daemons-file  Enable debugging of OpenRTE daemons, storing
>>> output
>>>                          in files
>>> -h|--help                This help message
>>>    --hnp                 Direct the orted to act as the HNP
>>>    --hnp-uri <arg0>      URI for the HNP
>>>    -nodes|--nodes <arg0>
>>>                          Regular expression defining nodes in system
>>>    -output-filename|--output-filename <arg0>
>>>                          Redirect output from application processes into
>>>                          filename.rank
>>>    --parent-uri <arg0>   URI for the parent if tree launch is enabled.
>>>    -report-bindings|--report-bindings
>>>                          Whether to report process bindings to stderr
>>>    --report-uri <arg0>   Report this process' uri on indicated pipe
>>> -s|--spin                Have the orted spin until we can connect a
>>> debugger
>>>                          to it
>>>    --set-sid             Direct the orted to separate from the current
>>>                          session
>>>    --singleton-died-pipe <arg0>
>>>                          Watch on indicated pipe for singleton
>>> termination
>>>    --test-suicide <arg0>
>>>                          Suicide instead of clean abort after delay
>>>    --tmpdir <arg0>       Set the root for the session directory tree
>>>    -tree-spawn|--tree-spawn
>>>                          Tree-based spawn in progress
>>>    -xterm|--xterm <arg0>
>>>                          Create a new xterm window and display output
>>> from
>>>                          the specified ranks there
>>>
>>> For additional mpirun arguments, run 'mpirun --help <category>'
>>>
>>> The following categories exist: general (Defaults to this option), debug,
>>>     output, input, mapping, ranking, binding, devel (arguments useful to
>>> OMPI
>>>     Developers), compatibility (arguments supported for backwards
>>> compatibility),
>>>     launch (arguments to modify launch options), and dvm (Distributed
>>> Virtual
>>>     Machine arguments).
>>>
>>>
>>>
>>> Then I tried adding the debug flag you mentioned and I got the same
>>> error. "
>>>
>>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>>>
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>>
>>>
>>> I also tried a third experiment and tried using a container I have used
>>> before. It has an older version of Open MPI but I get the same answer as I
>>> get now,
>>>
>>>
>>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>>>
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>>
>>>
>>> This is sounding like a path problem but I'm not sure. Adding the path
>>> to MPI in $PATH and $LD_LIBRARY_PATH didn't change the error message.
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>>
>>> ------------------------------
>>> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard
>>> Jr., Howard via users <users@lists.open-mpi.org>
>>> *Sent:* Friday, September 27, 2024 4:40 PM
>>> *To:* Open MPI Users <users@lists.open-mpi.org>
>>> *Cc:* Pritchard Jr., Howard (EXTERNAL) <howa...@lanl.gov>
>>> *Subject:* Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a
>>> container
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>> Hello Jeff,
>>>
>>>
>>>
>>> As an experiment why not try
>>>
>>>
>>>
>>> docker run  /usr/local/mpi/bin/orted
>>>
>>>
>>>
>>> ?
>>>
>>>
>>>
>>> and report the results?
>>>
>>>
>>>
>>> Also, you may want to add –-debug-daemons to the mpirun command line as
>>> another experiment.
>>>
>>>
>>>
>>> Howard
>>>
>>>
>>>
>>> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey
>>> Layton via users <users@lists.open-mpi.org>
>>> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
>>> *Date: *Friday, September 27, 2024 at 1:08 PM
>>> *To: *Open MPI Users <users@lists.open-mpi.org>
>>> *Cc: *Jeffrey Layton <layto...@gmail.com>
>>> *Subject: *[EXTERNAL] [OMPI users] Issue with mpirun inside a container
>>>
>>>
>>>
>>> Good afternoon,
>>>
>>>
>>>
>>> I'm getting an error message when I run "mpirun ... " inside a Docker
>>> container. The message:
>>>
>>>
>>>
>>>
>>>
>>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>>>
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>> This usually is caused by:
>>>
>>> * not finding the required libraries and/or binaries on
>>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>
>>> * lack of authority to execute on one or more specified nodes.
>>>   Please verify your allocation and authorities.
>>>
>>> * the inability to write startup files into /tmp
>>> (--tmpdir/orte_tmpdir_base).
>>>   Please check with your sys admin to determine the correct location to
>>> use.
>>>
>>> *  compilation of the orted with dynamic libraries when static are
>>> required
>>>   (e.g., on Cray). Please check your configure cmd line and consider
>>> using
>>>   one of the contrib/platform definitions for your system type.
>>>
>>> * an inability to create a connection back to mpirun due to a
>>>   lack of common network interfaces and/or no route found between
>>>   them. Please check network connectivity (including firewalls
>>>   and network routing requirements).
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> In googling I know this is a fairly common error message. BTW - great
>>> error message with good suggestions.
>>>
>>>
>>>
>>>
>>>
>>> The actual mpirun command is:
>>>
>>>
>>>
>>> /usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to
>>> none --map-by slot \
>>>     python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL
>>> --nepochs=$NEPOCHS \
>>>     --fsdir=$FSDIR
>>>
>>>
>>>
>>>
>>>
>>> This is called as the command for a "docker run ..." command.
>>>
>>>
>>>
>>> I've tried a couple of things such as making a multiple command for
>>> "docker run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same
>>> message. BTW - orted is located exactly where the error message indicated.
>>> I've tried not using the FPQ for mpirun and just use "mpirun". I get the
>>> same error message.
>>>
>>>
>>>
>>> I can run this "by hand" after starting the Docker container. I just run
>>> the container "docker run ..." but without the mpirun command, and then I
>>> run a simple script that defines the env variables and ends with the mpirun
>>> command; this works correctly. But using Slurm or using ssh directly to a
>>> node causes the above error message.
>>>
>>>
>>>
>>> BTW - someone else built this container with Open MPI and I can't really
>>> change it (I thought about rebuilding Open MPI in the container but I don't
>>> know the details of how it was built).
>>>
>>>
>>>
>>> Any thoughts?
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Jeff
>>>
>>>
>>>
>>

Reply via email to