Re: [OMPI users] Gridengine + Open MPI

Pak Lui Mon, 7 Jul 2008 16:20:19 -0400

Romaric David wrote:

Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, butwhile rolling the code from gridengine PLM into the rsh PLM code, thisfeature was left out because there was some lingering issues that Ididn't resolved and I lost track of it. Sorry but thanks for bringingit up, I will need to look at the issue again and reopen this ticketagainst v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?

I believe it will definitely be in 1.3 series, I am not sure about v1.2at this point.

So even it is the rsh PLM that starts the parallel job under SGE, thersh PLM can detect if the Open MPI job is started under the SGEParallel Environment (via checking some SGE env vars) and use the"qrsh --inherit" command to launch the parallel job the same way as itwas before. You can check by setting MCA to something like "--mcaplm_base_verbose 10" in your mpirun command and look for the launchcommands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] ----------------------------

     Regards,
    Romaric

How recent is the build that you use to generate the error above? Iassume you are using a trunk build?

I didn't see the complete error messages that you are seeing, but Ithink I am running into the same exact error too. It seems to be a weirderror that points out that the 'ssh' component not found. I don'tbelieve there's a component named 'ssh' here, because ssh and rsh sharedthe same component.

Well, it looks like something is broken in the plm that is responsiblefor launching the tight integration job for SGE.

I checked it used to work without problem with my earlier trunk build(r18645). I have to find out what has happened since...




Starting server daemon at host "burl-ct-v440-4"
Server daemon successfully started with task id "1.burl-ct-v440-4"

Establishing /opt/sge/utilbin/sol-sparc64/rsh session to hostburl-ct-v440-4 ...[burl-ct-v440-4:13749] mca: base: components_open: Looking for plmcomponents

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      burl-ct-v440-4
Framework: plm
Component: ssh
--------------------------------------------------------------------------

[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in filebase/ess_base_std_orted.c at line 70[burl-ct-v440-4:13749]--------------------------------------------------------------------------

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in fileess_env_module.c at line 135[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in fileruntime/orte_init.c at line 132[burl-ct-v440-4:13749]--------------------------------------------------------------------------

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in fileorted/orted_main.c at line 311

/opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0
reading exit code from shepherd ... 255

[burl-ct-v440-5:09789]--------------------------------------------------------------------------

A daemon (pid 9790) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

[burl-ct-v440-5:09789]--------------------------------------------------------------------------

mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

[burl-ct-v440-5:09789]--------------------------------------------------------------------------

mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        burl-ct-v440-4 - daemon did not report back when launched
[burl-ct-v440-5:09789] mca: base: close: component rsh closed
[burl-ct-v440-5:09789] mca: base: close: unloading component rsh


--

- Pak Lui
pak....@sun.com

Re: [OMPI users] Gridengine + Open MPI

Reply via email to