Am 30.12.2011 um 09:56 schrieb Semion Chernin: > 1) First of all you're right, not all executive hosts were defined as > submission too, > I fixed it, but problem happened not only with these nodes. > > 2) Can you explain how I can implement ckpt environment and replace it from > suspend in intel_all.q? > Does any submission have to be in form qsub -ckpt, or grid can do it > automatically,
For now I assume you select the queue by any meany anyway. So instead of a queue you could request the checkpointing environment, which is only attached to intel_all.q. It's of course possible to attach it also in a JSV depending on certain flags. > without any efforts from user side. Did you check the references in the link I posted: man sge_chkpt man checkpoint http://arc.liv.ac.uk/SGE/howto/checkpointing.html http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf (nice state diagrams) In detail a "Userdefined interface" according to the Howto will be most appropriate. The Howto also shows the effect on the list of jobs. -- Reuti > ----- Original Message ----- > From: Reuti <[email protected]> > Date: Thursday, December 29, 2011 20:53 > Subject: Re: [gridengine users] Sometimes resubmission doesn't work > To: Semi <[email protected]> > Cc: "[email protected]" <[email protected]> > > > Hi, > > > > Am 29.12.2011 um 19:15 schrieb Semi: > > > > > > > > I have public queue intel_all.q and private queue namd.q > > with > > > subordinate_list intel_all.q=1 > > > > > > Some nodes of namd.q included in intel_all.q, that have > > > suspend_method > > /storage/Scripts/job_resubmit.sh $job_id > > > > > > cat /storage/Scripts/job_resubmit.sh > > > #!/bin/sh > > > /storage/SGE/bin/lx24-amd64/qresub $1 > > > /storage/SGE/bin/lx24-amd64/qdel $1 > > > > is it happening only on certain hosts? In this configuration all > > exechosts also need to be submit hosts. The other annoyance I > > see is, that the resubmitted jobs are pushed at the end of the > > waiting jobs again. > > > > > > > When even 1 job from private queue submitted, public jobs have > > to be resubmitted and killed. > > > Sometimes it doesn't work, they got status S (suspend) > > > > > sge143 lx24-amd64 24 45.65 47.3G 30.4G 48.0G > > 0.0 > > > > > namd.q BIP 24/24 > > > > > intel_all.q BIP 23/24 S > > > > > > 5219266 0.50511 SemanticEx > > alexla S 12/29/2011 16:34:08 intel_all.q@sge143 1 > > > > Maybe the `qdel` didn't succeed. You can check the messages > > files of the qmaster and the exechost whether it was issued and > > executed. If the job isn't deleted or stopped by a signal, they > > will continue as you observe it right now. > > > > I would suggest to remove the suspend_method, and define a > > checkpointing interface, which is attached to intel_all.q and to > > reach this queue it's then sufficient to request the > > checkpointing interface. When the chechkpointing interface is > > setup to migrate on suspend, the job (with still the same > > jobnumber) will be requeued automatically. > > > > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2193 > > > > -- Reuti > > > > > > > and stiil actually running and take resources of the node (CPU > > & memory). > > > How I can solve this problem? > > > > > > > > > > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
