Howard,

I tried the first experiment of using orted instead of mpirun. The output is 
below.


/usr/local/mpi/bin/orted: Error: unknown option "-np"
Type '/usr/local/mpi/bin/orted --help' for usage.
Usage: /usr/local/mpi/bin/orted [OPTION]...
-d|--debug               Debug the OpenRTE
   --daemonize           Daemonize the orted into the background
   --debug-daemons       Enable debugging of OpenRTE daemons
   --debug-daemons-file  Enable debugging of OpenRTE daemons, storing output
                         in files
-h|--help                This help message
   --hnp                 Direct the orted to act as the HNP
   --hnp-uri <arg0>      URI for the HNP
   -nodes|--nodes <arg0>
                         Regular expression defining nodes in system
   -output-filename|--output-filename <arg0>
                         Redirect output from application processes into
                         filename.rank
   --parent-uri <arg0>   URI for the parent if tree launch is enabled.
   -report-bindings|--report-bindings
                         Whether to report process bindings to stderr
   --report-uri <arg0>   Report this process' uri on indicated pipe
-s|--spin                Have the orted spin until we can connect a debugger
                         to it
   --set-sid             Direct the orted to separate from the current
                         session
   --singleton-died-pipe <arg0>
                         Watch on indicated pipe for singleton termination
   --test-suicide <arg0>
                         Suicide instead of clean abort after delay
   --tmpdir <arg0>       Set the root for the session directory tree
   -tree-spawn|--tree-spawn
                         Tree-based spawn in progress
   -xterm|--xterm <arg0>
                         Create a new xterm window and display output from
                         the specified ranks there

For additional mpirun arguments, run 'mpirun --help <category>'

The following categories exist: general (Defaults to this option), debug,
    output, input, mapping, ranking, binding, devel (arguments useful to OMPI
    Developers), compatibility (arguments supported for backwards 
compatibility),
    launch (arguments to modify launch options), and dvm (Distributed Virtual
    Machine arguments).



Then I tried adding the debug flag you mentioned and I got the same error. "

bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.


I also tried a third experiment and tried using a container I have used before. 
It has an older version of Open MPI but I get the same answer as I get now,


bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.


This is sounding like a path problem but I'm not sure. Adding the path to MPI 
in $PATH and $LD_LIBRARY_PATH didn't change the error message.

Thanks!

Jeff


________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Pritchard Jr., 
Howard via users <users@lists.open-mpi.org>
Sent: Friday, September 27, 2024 4:40 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Pritchard Jr., Howard (EXTERNAL) <howa...@lanl.gov>
Subject: Re: [OMPI users] [EXTERNAL] Issue with mpirun inside a container

External email: Use caution opening links or attachments


Hello Jeff,



As an experiment why not try



docker run  /usr/local/mpi/bin/orted



?



and report the results?



Also, you may want to add –-debug-daemons to the mpirun command line as another 
experiment.



Howard



From: users <users-boun...@lists.open-mpi.org> on behalf of Jeffrey Layton via 
users <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Friday, September 27, 2024 at 1:08 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Jeffrey Layton <layto...@gmail.com>
Subject: [EXTERNAL] [OMPI users] Issue with mpirun inside a container



Good afternoon,



I'm getting an error message when I run "mpirun ... " inside a Docker 
container. The message:





bash: line 1: /usr/local/mpi/bin/orted: No such file or directory
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------





In googling I know this is a fairly common error message. BTW - great error 
message with good suggestions.





The actual mpirun command is:



/usr/local/mpi/bin/mpirun -np $NP -H $JJ --allow-run-as-root -bind-to none 
--map-by slot \
    python3 $BIN --checkpoint=$CHECKPOINT --model=$MODEL --nepochs=$NEPOCHS \
    --fsdir=$FSDIR





This is called as the command for a "docker run ..." command.



I've tried a couple of things such as making a multiple command for "docker run 
..." that sets $PATH and $LD_LIBRARY_PATH and I get the same message. BTW - 
orted is located exactly where the error message indicated. I've tried not 
using the FPQ for mpirun and just use "mpirun". I get the same error message.



I can run this "by hand" after starting the Docker container. I just run the 
container "docker run ..." but without the mpirun command, and then I run a 
simple script that defines the env variables and ends with the mpirun command; 
this works correctly. But using Slurm or using ssh directly to a node causes 
the above error message.



BTW - someone else built this container with Open MPI and I can't really change 
it (I thought about rebuilding Open MPI in the container but I don't know the 
details of how it was built).



Any thoughts?



Thanks!



Jeff


Reply via email to