Hi,
Am 07.07.2008 um 11:31 schrieb Romaric David:
Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official,
but while rolling the code from gridengine PLM into the rsh PLM
code, this feature was left out because there was some lingering
issues that I didn't resolved and I lost track of it. Sorry but
thanks for bringing it up, I will need to look at the issue again
and reopen this ticket against v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?
So even it is the rsh PLM that starts the parallel job under SGE,
the rsh PLM can detect if the Open MPI job is started under the
SGE Parallel Environment (via checking some SGE env vars) and use
the "qrsh --inherit" command to launch the parallel job the same
way as it was before. You can check by setting MCA to something
like "--mca plm_base_verbose 10" in your mpirun command and look
for the launch commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't
get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] ----------------------------
you mean with the plain rsh startup, like a loose integration? Isn't
in this case a proper hostlist necessary, which is for other MPI
implementations built in the start_proc_args defined routine? AFAIK
you can disregard the hostlist only with Open MPI's tight SGE support.
-- Reuti