Adam C Powell IV wrote:
As mentioned, I'm running in a chroot environment, so rsh and ssh won't
work: "rsh localhost" will rsh into the primary local host environment,
not the chroot, which will fail.
[The purpose is to be able to build and test MPI programs in the Debian
unstable distribution, without upgrading the whole machine to unstable.
Though most machines I use for this purpose run Debian stable or
testing, the machine I'm currently using runs a very old Fedora, for
which I don't think OpenMPI is available.]
Allright, I understand what you are trying to do now. To be honest, I
don't think we have ever really thought about this use case. We always
figured that to test Open MPI people would simply install it in a
different directory and use it from there.
With MPICH, mpirun -np 1 just runs the new process in the current
context, without rsh/ssh, so it works in a chroot. Does OpenMPI not
support this functionality?
Open MPI does support this functionality. First, a bit of explanation:
We use 'pls' (process launching system) components to handling the
launching of processes. There are components for slurm, gridengine, rsh,
and others. At runtime we open each of these components and query them
as to whether they can be used. The original error you posted says that
none of the 'pls' components can be used because all of they detected
they could not run in your setup. The slurm one excluded itself because
there were no environment variables set indicating it is running under
SLURM. Similarly, the gridengine pls said it cannot run as well. The
'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are
available (I assume this is the case, though you did not explicitly say
they were not available).
But in this case, you do want the 'rsh' pls to be used. It will
automatically fork any local processes, and will user rsh/ssh to launch
any remote processes. Again, I don't think we ever imagined the use case
on a UNIX-like system where there are no launchers like SLURM
available, and rsh/ssh also wasn't available (Open MPI is, after all,
primarily concerned with multi-node operation).
So, there are several ways around this:
1. Make rsh or ssh available, even though they will not be used.
2. Tell the 'rsh' pls component to use a dummy program such as
/bin/false by adding the following to the command line:
-mca pls_rsh_agent /bin/false
3. Create a dummy 'rsh' executable that is available in your path.
For instance:
[tprins@odin ~]$ which ssh
/usr/bin/which: no ssh in
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
[tprins@odin ~]$ which rsh
/usr/bin/which: no rsh in
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
[tprins@odin ~]$ mpirun -np 1 hostname
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 317
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_pls_base_select failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 46
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
orterun.c at line 399
[tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false hostname
odin.cs.indiana.edu
[tprins@odin ~]$ touch usr/bin/rsh
[tprins@odin ~]$ chmod +x usr/bin/rsh
[tprins@odin ~]$ mpirun -np 1 hostname
odin.cs.indiana.edu
[tprins@odin ~]$
I hope this helps,
Tim
Thanks,
Adam
On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
This is strange. I assume that you what to use rsh or ssh to launch the
processes?
If you want to use ssh, does "which ssh" find ssh? Similarly, if you
want to use rsh, does "which rsh" find rsh?
Thanks,
Tim
Adam C Powell IV wrote:
On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
Adam C Powell IV wrote:
Greetings,
I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:
[snip]
What could be wrong? Does orterun not run in a chroot environment?
What more can I do to investigate further?
Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20
Then send the output to the list.
Thanks! Here's the output:
$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mca: base: components_open: Looking for pls components
[new-host-3:19201] mca: base: components_open: distilling pls components
[new-host-3:19201] mca: base: components_open: accepting all pls components
[new-host-3:19201] mca: base: components_open: opening pls components
[new-host-3:19201] mca: base: components_open: found loaded component
gridengine[new-host-3:19201] mca: base: components_open: component gridengine
open function successful
[new-host-3:19201] mca: base: components_open: found loaded component proxy
[new-host-3:19201] mca: base: components_open: component proxy open function
successful
[new-host-3:19201] mca: base: components_open: found loaded component rsh
[new-host-3:19201] mca: base: components_open: component rsh open function
successful
[new-host-3:19201] mca: base: components_open: found loaded component slurm
[new-host-3:19201] mca: base: components_open: component slurm open function
successful
[new-host-3:19201] orte:base:select: querying component gridengine
[new-host-3:19201] pls:gridengine: NOT available for selection
[new-host-3:19201] orte:base:select: querying component proxy
[new-host-3:19201] orte:base:select: querying component rsh
[new-host-3:19201] orte:base:select: querying component slurm
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 312
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_pls_base_select failed
--> Returned value -1 instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at
line 52
--------------------------------------------------------------------------
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
-Adam