Reuti wrote:
Hi,

Am 07.07.2008 um 11:31 schrieb Romaric David:

Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?

So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] ----------------------------

you mean with the plain rsh startup, like a loose integration? Isn't in this case a proper hostlist necessary, which is for other MPI implementations built in the start_proc_args defined routine? AFAIK you can disregard the hostlist only with Open MPI's tight SGE support.

I think he's using the tight integration and not using a plain rsh startup. From the output it shows that he's using the bundled rsh from SGE. From my run with a recent trunk, something is indeed broken for tight integration. I am looking at it now.


-- Reuti
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak....@sun.com

Reply via email to