Hi, Am 29.12.2011 um 19:15 schrieb Semi:
> > I have public queue intel_all.q and private queue namd.q with > subordinate_list intel_all.q=1 > > Some nodes of namd.q included in intel_all.q, that have > suspend_method /storage/Scripts/job_resubmit.sh $job_id > > cat /storage/Scripts/job_resubmit.sh > #!/bin/sh > /storage/SGE/bin/lx24-amd64/qresub $1 > /storage/SGE/bin/lx24-amd64/qdel $1 is it happening only on certain hosts? In this configuration all exechosts also need to be submit hosts. The other annoyance I see is, that the resubmitted jobs are pushed at the end of the waiting jobs again. > When even 1 job from private queue submitted, public jobs have to be > resubmitted and killed. > Sometimes it doesn't work, they got status S (suspend) > sge143 lx24-amd64 24 45.65 47.3G 30.4G 48.0G > 0.0 > namd.q BIP 24/24 > intel_all.q BIP 23/24 S > > 5219266 0.50511 SemanticEx alexla S 12/29/2011 16:34:08 > intel_all.q@sge143 1 Maybe the `qdel` didn't succeed. You can check the messages files of the qmaster and the exechost whether it was issued and executed. If the job isn't deleted or stopped by a signal, they will continue as you observe it right now. I would suggest to remove the suspend_method, and define a checkpointing interface, which is attached to intel_all.q and to reach this queue it's then sufficient to request the checkpointing interface. When the chechkpointing interface is setup to migrate on suspend, the job (with still the same jobnumber) will be requeued automatically. http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2193 -- Reuti > and stiil actually running and take resources of the node (CPU & memory). > How I can solve this problem? > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
