Romaric David wrote:
Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, but
while rolling the code from gridengine PLM into the rsh PLM code, this
feature was left out because there was some lingering issues that I
didn't resolved and I lost track of it. Sorry but thanks for bringing
it up, I will need to look at the issue again and reopen this ticket
against v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?
I believe it will definitely be in 1.3 series, I am not sure about v1.2
at this point.
So even it is the rsh PLM that starts the parallel job under SGE, the
rsh PLM can detect if the Open MPI job is started under the SGE
Parallel Environment (via checking some SGE env vars) and use the
"qrsh --inherit" command to launch the parallel job the same way as it
was before. You can check by setting MCA to something like "--mca
plm_base_verbose 10" in your mpirun command and look for the launch
commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] ----------------------------
Regards,
Romaric
How recent is the build that you use to generate the error above? I
assume you are using a trunk build?
I didn't see the complete error messages that you are seeing, but I
think I am running into the same exact error too. It seems to be a weird
error that points out that the 'ssh' component not found. I don't
believe there's a component named 'ssh' here, because ssh and rsh shared
the same component.
Well, it looks like something is broken in the plm that is responsible
for launching the tight integration job for SGE.
I checked it used to work without problem with my earlier trunk build
(r18645). I have to find out what has happened since...
Starting server daemon at host "burl-ct-v440-4"
Server daemon successfully started with task id "1.burl-ct-v440-4"
Establishing /opt/sge/utilbin/sol-sparc64/rsh session to host
burl-ct-v440-4 ...
[burl-ct-v440-4:13749] mca: base: components_open: Looking for plm
components
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: burl-ct-v440-4
Framework: plm
Component: ssh
--------------------------------------------------------------------------
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file
base/ess_base_std_orted.c at line 70
[burl-ct-v440-4:13749]
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_base_open failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file
ess_env_module.c at line 135
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 132
[burl-ct-v440-4:13749]
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file
orted/orted_main.c at line 311
/opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[burl-ct-v440-5:09789]
--------------------------------------------------------------------------
A daemon (pid 9790) died unexpectedly with status 255 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
[burl-ct-v440-5:09789]
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
[burl-ct-v440-5:09789]
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
burl-ct-v440-4 - daemon did not report back when launched
[burl-ct-v440-5:09789] mca: base: close: component rsh closed
[burl-ct-v440-5:09789] mca: base: close: unloading component rsh
--
- Pak Lui
pak....@sun.com