Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a container

Jeffrey Layton via users Mon, 30 Sep 2024 12:58:40 -0700

Gilles,

This was exactly it - thank you.


If I wanted to run the code in the container across multiple nodes, I would
need to do something like "mpirun ... 'docker run ...' "?

Thanks!

Jeff


On Mon, Sep 30, 2024 at 2:38 AM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Jeffrey,
>
> You are invoking mpirun with the -H <hostfile> option, so basically mpirun
> inside your container will
> ssh ... orted ...
> but the remote orted will not run in a container, and hence the error
> message.
> Note it is possible you planned to run everything in the container, but
> for some reason Open MPI failed to figure
> out the name in the host file is the container, in this case, try without
> the -H option, or try using localhost in the host file.
>
> Cheers,
>
> Gilles
>
> On Mon, Sep 30, 2024 at 1:34 AM Jeffrey Layton via users <
> users@lists.open-mpi.org> wrote:
>
>> Howard,
>>
>> I tried the first experiment of using orted instead of mpirun. The output
>> is below.
>>
>>
>> /usr/local/mpi/bin/orted: Error: unknown option "-np"
>> Type '/usr/local/mpi/bin/orted --help' for usage.
>> Usage: /usr/local/mpi/bin/orted [OPTION]...
>> -d|--debug               Debug the OpenRTE
>>    --daemonize           Daemonize the orted into the background
>>    --debug-daemons       Enable debugging of OpenRTE daemons
>>    --debug-daemons-file  Enable debugging of OpenRTE daemons, storing
>> output
>>                          in files
>> -h|--help                This help message
>>    --hnp                 Direct the orted to act as the HNP
>>    --hnp-uri <arg0>      URI for the HNP
>>    -nodes|--nodes <arg0>
>>                          Regular expression defining nodes in system
>>    -output-filename|--output-filename <arg0>
>>                          Redirect output from application processes into
>>                          filename.rank
>>    --parent-uri <arg0>   URI for the parent if tree launch is enabled.
>>    -report-bindings|--report-bindings
>>                          Whether to report process bindings to stderr
>>    --report-uri <arg0>   Report this process' uri on indicated pipe
>> -s|--spin                Have the orted spin until we can connect a
>> debugger
>>                          to it
>>    --set-sid             Direct the orted to separate from the current
>>                          session
>>    --singleton-died-pipe <arg0>
>>                          Watch on indicated pipe for singleton termination
>>    --test-suicide <arg0>
>>                          Suicide instead of clean abort after delay
>>    --tmpdir <arg0>       Set the root for the session directory tree
>>    -tree-spawn|--tree-spawn
>>                          Tree-based spawn in progress
>>    -xterm|--xterm <arg0>
>>                          Create a new xterm window and display output from
>>                          the specified ranks there
>>
>> For additional mpirun arguments, run 'mpirun --help <category>'
>>
>> The following categories exist: general (Defaults to this option), debug,
>>     output, input, mapping, ranking, binding, devel (arguments useful to
>> OMPI
>>     Developers), compatibility (arguments supported for backwards
>> compatibility),
>>     launch (arguments to modify launch options), and dvm (Distributed
>> Virtual
>>     Machine arguments).
>>
>>
>>
>> Then I tried adding the debug flag you mentioned and I got the same
>> error. "
>>
>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>>
>>
>> I also tried a third experiment and tried using a container I have used
>> before. It has an older version of Open MPI but I get the same answer as I
>> get now,
>>
>>
>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>>
>>
>> This is sounding like a path problem but I'm not sure. Adding the path to
>> MPI in $PATH and $LD_LIBRARY_PATH didn't change the error message.
>>
>> Thanks!
>>
>> Jeff
>>
>>
>> ------------------------------
>> *From:* users <users-boun...@lists.open-mpi.org> on behalf of Pritchard
>> Jr., Howard via users <users@lists.open-mpi.org>
>> *Sent:* Friday, September 27, 2024 4:40 PM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Cc:* Pritchard Jr., Howard (EXTERNAL) <howa...@lanl.gov>
>> *Subject:* Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a
>> container
>>
>> *External email: Use caution opening links or attachments*
>>
>> Hello Jeff,
>>
>>
>>
>> As an experiment why not try
>>
>>
>>
>> docker run  /usr/local/mpi/bin/orted
>>
>>
>>
>> ?
>>
>>
>>
>> and report the results?
>>
>>
>>
>> Also, you may want to add –-debug-daemons to the mpirun command line as
>> another experiment.
>>
>>
>>
>> Howard
>>
>>
>>
>> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey
>> Layton via users <users@lists.open-mpi.org>
>> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
>> *Date: *Friday, September 27, 2024 at 1:08 PM
>> *To: *Open MPI Users <users@lists.open-mpi.org>
>> *Cc: *Jeffrey Layton <layto...@gmail.com>
>> *Subject: *[EXTERNAL] [OMPI users] Issue with mpirun inside a container
>>
>>
>>
>> Good afternoon,
>>
>>
>>
>> I'm getting an error message when I run "mpirun ... " inside a Docker
>> container. The message:
>>
>>
>>
>>
>>
>> bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>   Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp
>> (--tmpdir/orte_tmpdir_base).
>>   Please check with your sys admin to determine the correct location to
>> use.
>>
>> *  compilation of the orted with dynamic libraries when static are
>> required
>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>   one of the contrib/platform definitions for your system type.
>>
>> * an inability to create a connection back to mpirun due to a
>>   lack of common network interfaces and/or no route found between
>>   them. Please check network connectivity (including firewalls
>>   and network routing requirements).
>> --------------------------------------------------------------------------
>>
>>
>>
>>
>>
>> In googling I know this is a fairly common error message. BTW - great
>> error message with good suggestions.
>>
>>
>>
>>
>>
>> The actual mpirun command is:
>>
>>
>>
>> /usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to
>> none --map-by slot \
>>     python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL
>> --nepochs=$NEPOCHS \
>>     --fsdir=$FSDIR
>>
>>
>>
>>
>>
>> This is called as the command for a "docker run ..." command.
>>
>>
>>
>> I've tried a couple of things such as making a multiple command for
>> "docker run ..." that sets $PATH and $LD_LIBRARY_PATH and I get the same
>> message. BTW - orted is located exactly where the error message indicated.
>> I've tried not using the FPQ for mpirun and just use "mpirun". I get the
>> same error message.
>>
>>
>>
>> I can run this "by hand" after starting the Docker container. I just run
>> the container "docker run ..." but without the mpirun command, and then I
>> run a simple script that defines the env variables and ends with the mpirun
>> command; this works correctly. But using Slurm or using ssh directly to a
>> node causes the above error message.
>>
>>
>>
>> BTW - someone else built this container with Open MPI and I can't really
>> change it (I thought about rebuilding Open MPI in the container but I don't
>> know the details of how it was built).
>>
>>
>>
>> Any thoughts?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Jeff
>>
>>
>>
>

Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a container

Reply via email to