Am 30.12.2011 um 09:56 schrieb Semion Chernin:

> 1) First of all you're right, not all executive hosts were defined as 
> submission too,
> I fixed it, but problem happened not only with these nodes.
> 
> 2) Can you explain how I can implement ckpt environment and replace it from 
> suspend in intel_all.q?
> Does any submission have to be in form qsub -ckpt, or grid can do it 
> automatically,

For now I assume you select the queue by any meany anyway. So instead of a 
queue you could request the checkpointing environment, which is only attached 
to intel_all.q. It's of course possible to attach it also in a JSV depending on 
certain flags.

> without any efforts from user side. 

Did you check the references in the link I posted:

man sge_chkpt
man checkpoint

http://arc.liv.ac.uk/SGE/howto/checkpointing.html
http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.pdf (nice state diagrams)

In detail a "Userdefined interface" according to the Howto will be most 
appropriate. The Howto also shows the effect on the list of jobs.

-- Reuti


> ----- Original Message -----
> From: Reuti <[email protected]>
> Date: Thursday, December 29, 2011 20:53
> Subject: Re: [gridengine users] Sometimes resubmission doesn't work
> To: Semi <[email protected]>
> Cc: "[email protected]" <[email protected]>
> 
> > Hi,
> > 
> > Am 29.12.2011 um 19:15 schrieb Semi:
> > 
> > > 
> > > I have public queue intel_all.q and  private queue namd.q 
> > with 
> > > subordinate_list      intel_all.q=1
> > > 
> > > Some nodes of namd.q included in intel_all.q, that have
> > > suspend_method        
> > /storage/Scripts/job_resubmit.sh $job_id
> > > 
> > > cat /storage/Scripts/job_resubmit.sh
> > > #!/bin/sh
> > > /storage/SGE/bin/lx24-amd64/qresub $1
> > > /storage/SGE/bin/lx24-amd64/qdel $1
> > 
> > is it happening only on certain hosts? In this configuration all 
> > exechosts also need to be submit hosts. The other annoyance I 
> > see is, that the resubmitted jobs are pushed at the end of the 
> > waiting jobs again.
> >  
> > 
> > > When even 1 job from private queue submitted, public jobs have 
> > to be resubmitted and killed.
> > > Sometimes it doesn't work, they got status S (suspend)
> > > 
> > sge143                  lx24-amd64     24 45.65   47.3G   30.4G   48.0G     
> > 0.0
> > >    
> > namd.q               BIP   24/24    
> > >    
> > intel_all.q          BIP   23/24    S
> > > 
> > > 5219266 0.50511 SemanticEx 
> > alexla       S     12/29/2011 16:34:08 intel_all.q@sge143                 1
> > 
> > Maybe the `qdel` didn't succeed. You can check the messages 
> > files of the qmaster and the exechost whether it was issued and 
> > executed. If the job isn't deleted or stopped by a signal, they 
> > will continue as you observe it right now.
> > 
> > I would suggest to remove the suspend_method, and define a 
> > checkpointing interface, which is attached to intel_all.q and to 
> > reach this queue it's then sufficient to request the 
> > checkpointing interface. When the chechkpointing interface is 
> > setup to migrate on suspend, the job (with still the same 
> > jobnumber) will be requeued automatically.
> > 
> > http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2193
> > 
> > -- Reuti
> > 
> > 
> > > and stiil actually running and take resources of the node (CPU 
> > & memory).
> > > How I can solve this problem?
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> > 
> >
> ‎


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to