Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3:
Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ?
So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] ---------------------------- Regards, Romaric -- -------------------------------------- R. David - da...@icps.u-strasbg.fr Tel. : 03 90 24 45 48 (Fax 45 47) --------------------------------------