1) First of all you're right, not all executive hosts were defined as submission too, I fixed it, but problem happened not only with these nodes.
2) Can you explain how I can implement ckpt environment and replace it from suspend in intel_all.q? Does any submission have to be in form qsub -ckpt, or grid can do it automatically, without any efforts from user side. ----- Original Message ----- From: Reuti <[email protected]> Date: Thursday, December 29, 2011 20:53 Subject: Re: [gridengine users] Sometimes resubmission doesn't work To: Semi <[email protected]> Cc: "[email protected]" <[email protected]> > Hi, > > Am 29.12.2011 um 19:15 schrieb Semi: > > > > > I have public queue intel_all.q and private queue namd.q > with > > subordinate_list intel_all.q=1 > > > > Some nodes of namd.q included in intel_all.q, that have > > suspend_method > /storage/Scripts/job_resubmit.sh $job_id > > > > cat /storage/Scripts/job_resubmit.sh > > #!/bin/sh > > /storage/SGE/bin/lx24-amd64/qresub $1 > > /storage/SGE/bin/lx24-amd64/qdel $1 > > is it happening only on certain hosts? In this configuration all > exechosts also need to be submit hosts. The other annoyance I > see is, that the resubmitted jobs are pushed at the end of the > waiting jobs again. > > > > When even 1 job from private queue submitted, public jobs have > to be resubmitted and killed. > > Sometimes it doesn't work, they got status S (suspend) > > > sge143 lx24-amd64 24 45.65 47.3G 30.4G 48.0G > 0.0 > > > namd.q BIP 24/24 > > > intel_all.q BIP 23/24 S > > > > 5219266 0.50511 SemanticEx > alexla S 12/29/2011 16:34:08 intel_all.q@sge143 1 > > Maybe the `qdel` didn't succeed. You can check the messages > files of the qmaster and the exechost whether it was issued and > executed. If the job isn't deleted or stopped by a signal, they > will continue as you observe it right now. > > I would suggest to remove the suspend_method, and define a > checkpointing interface, which is attached to intel_all.q and to > reach this queue it's then sufficient to request the > checkpointing interface. When the chechkpointing interface is > setup to migrate on suspend, the job (with still the same > jobnumber) will be requeued automatically. > > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2193 > > -- Reuti > > > > and stiil actually running and take resources of the node (CPU > & memory). > > How I can solve this problem? > > > > > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
