Reuti wrote:
Hi,
Am 07.07.2008 um 11:31 schrieb Romaric David:
Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, but
while rolling the code from gridengine PLM into the rsh PLM code,
this feature was left out because there was some lingering issues
that I didn't resolved and I lost track of it. Sorry but thanks for
bringing it up, I will need to look at the issue again and reopen
this ticket against v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?
So even it is the rsh PLM that starts the parallel job under SGE, the
rsh PLM can detect if the Open MPI job is started under the SGE
Parallel Environment (via checking some SGE env vars) and use the
"qrsh --inherit" command to launch the parallel job the same way as
it was before. You can check by setting MCA to something like "--mca
plm_base_verbose 10" in your mpirun command and look for the launch
commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] ----------------------------
you mean with the plain rsh startup, like a loose integration? Isn't in
this case a proper hostlist necessary, which is for other MPI
implementations built in the start_proc_args defined routine? AFAIK you
can disregard the hostlist only with Open MPI's tight SGE support.
I think he's using the tight integration and not using a plain rsh
startup. From the output it shows that he's using the bundled rsh from
SGE. From my run with a recent trunk, something is indeed broken for
tight integration. I am looking at it now.
-- Reuti
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
- Pak Lui
pak....@sun.com