Hello again. Since I already had a 6.1 version of the sge I reverted to it and included the hacks (ssh, sshd -i and qlogin_wrap) and in this way both the interactives qsh and qrsh and batch qsub worked with openmpi. For me this is a solution, but I'm still curious of what it was going on in 6.2. I will see if there exists a list like this for the sge.
Thanks a lot. -- Jaime Perea El Jueves, 2 de Octubre de 2008, Rolf Vandevaart escribió: > On 10/02/08 11:18, Reuti wrote: > > Am 02.10.2008 um 16:51 schrieb Jaime Perea: > >> Hi > >> > >> builtin, do I have to change them to ssh and sshd as in sge 6.1? > > > > I always used only rsh, as ssh doesn't provide a Tight Integration > > with correct accounting (unless you compiled SGE with -tigth-ssh on > > your own). > > > > But it would be worth a try with either the rsh or ssh stuff, as the > > builtin starter is a new feature of SGE 6.2. > > > > -- Reuti > > As was mentioned, SGE 6.2 has a new Integrated Job Starter so that rsh > and ssh do not need to be used to start jobs on remote nodes. This is > the recommended way of starting as it is faster than ssh and more > scalable than rsh. And, you do not need to do any hacks for proper job > accounting like was needed for ssh. > > Under the covers, Open MPI uses qrsh to start the MPI jobs on all the > nodes. > > Not sure if that helps, but just wanted to mention that information. > > Rolf > > >> Thanks again > >> > >> -- > >> Jaime Perea > >> > >> El Jueves, 2 de Octubre de 2008, Reuti escribió: > >>> Am 02.10.2008 um 16:12 schrieb Jaime Perea: > >>>> Hi again, thanks for the answer > >>>> > >>>> Actually I took the definition of the pe from the openmpi > >>>> webpage, in my case > >>>> > >>>> qconf -sp orte > >>>> pe_name orte > >>>> slots 24 > >>>> user_lists NONE > >>>> xuser_lists NONE > >>>> start_proc_args /bin/true > >>>> stop_proc_args /bin/true > >>>> allocation_rule $round_robin > >>>> control_slaves TRUE > >>>> job_is_first_task TRUE > >>>> urgency_slots min > >>>> accounting_summary FALSE > >>>> > >>>> Our sge is version 6.2 and openmpi was configured with > >>>> the --with-sge switch of course. > >>> > >>> In SGE 6.2 two types of remote startup are implemented. Which one > >>> are you using (builtin or the former settings for each command) in > >>> the SGE configuration? > >>> > >>> -- Reuti > >>> > >>>> Regards > >>>> > >>>> -- > >>>> Jaime Perea > >>>> > >>>> El Jueves, 2 de Octubre de 2008, Reuti escribió: > >>>>> Hi, > >>>>> > >>>>> Am 02.10.2008 um 15:37 schrieb Jaime Perea: > >>>>>> Hello, > >>>>>> > >>>>>> I am having some problems with a combination of openmpi+sge6.2 > >>>>>> > >>>>>> Currently I'm working with the 1.3a1r19666 openmpi release and > >>>>>> the > >>>>> > >>>>> AFAIK, you have to enable SGE support in Open MPI 1.3 during its > >>>>> compilation. > >>>>> > >>>>>> myrinet gm libraries (2.1.19) but the problem was the same with > >>>>>> the prior 1.3 version. In short, I'm able to send jobs to a que > >>>>>> via qrsh, > >>>>>> more or less this way, > >>>>>> > >>>>>> qrsh -cwd -V -q para -pe orte 6 mpirun -np 6 ctiming > >>>>> > >>>>> It should also work without specifying the number of slots a > >>>>> second time, i.e.: > >>>>> > >>>>> qrsh -cwd -V -q para -pe orte 6 mpirun ctiming > >>>>> > >>>>>> ctiming is a small test program and in this way it works, but if > >>>>>> I try to > >>>>>> send the same task by using qsub on a script like this one > >>>>>> > >>>>>> #!/bin/sh > >>>>>> #$ -pe orte 6 > >>>>> > >>>>> This PE has just /bin/true for start-/stop_proc_args? > >>>>> > >>>>>> #$ -q para > >>>>>> #$ -cwd > >>>>>> # > >>>>>> mpirun -np $NSLOTS /model/jaime/ctiming > >>>>> > >>>>> mpirun /model/jaime/ctiming > >>>>> > >>>>>> It fails with a message like this, > >>>>>> .............. > >>>>>> > >>>>>> error reading job context from "qlogin_starter" > >>>>> > >>>>> qlogin_starter should of course only be started with a qlogin > >>>>> command in SGE. > >>>>> > >>>>>> ----------------------------------------------------------------- > >>>>>>--- -- > >>>>>> ---- > >>>>>> A daemon (pid 11207) died unexpectedly with status 1 while > >>>>>> attempting > >>>>>> to launch so we are aborting. > >>>>>> > >>>>>> There may be more information reported by the environment (see > >>>>>> above). > >>>>>> > >>>>>> This may be because the daemon was unable to find all the needed > >>>>>> shared > >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to > >>>>>> have the > >>>>>> location of the shared libraries on the remote nodes and this > >>>>>> will automatically be forwarded to the remote nodes. > >>>>>> > >>>>>> ............. > >>>>>> > >>>>>> I know that LD_LIBRARY_PATH is not the problem, since I checked > >>>>>> that all > >>>>>> the environment is present.... any idea? > >>>>>> > >>>>>> For previous releases of the sge and openmpi I was able to do > >>>>>> them work > >>>>>> together with a few wrappers, > >>>>> > >>>>> Which version of SGE are you using? > >>>>> > >>>>> -- Reuti > >>>>> > >>>>>> but now the integration looks much better! > >>>>>> This happen only when sending openmpi jobs. > >>>>>> > >>>>>> Thanks and all the best > >>>>>> > >>>>>> --- > >>>>>> > >>>>>> Jaime D. Perea Duarte. <jaime at iaa dot es> > >>>>>> Linux registered user #10472 > >>>>>> > >>>>>> Dep. Astrofisica Extragalactica. > >>>>>> Instituto de Astrofisica de Andalucia (CSIC) > >>>>>> Apdo. 3004, 18080 Granada, Spain. > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users