1) First of all you're right, not all executive hosts were defined as 
submission too,
I fixed it, but problem happened not only with these nodes.

2) Can you explain how I can implement ckpt environment and replace it from 
suspend in intel_all.q?
Does any submission have to be in form qsub -ckpt, or grid can do it 
automatically,
without any efforts from user side. 

----- Original Message -----
From: Reuti <[email protected]>
Date: Thursday, December 29, 2011 20:53
Subject: Re: [gridengine users] Sometimes resubmission doesn't work
To: Semi <[email protected]>
Cc: "[email protected]" <[email protected]>

> Hi,
> 
> Am 29.12.2011 um 19:15 schrieb Semi:
> 
> > 
> > I have public queue intel_all.q and  private queue namd.q 
> with 
> > subordinate_list      intel_all.q=1
> > 
> > Some nodes of namd.q included in intel_all.q, that have
> > suspend_method        
> /storage/Scripts/job_resubmit.sh $job_id
> > 
> > cat /storage/Scripts/job_resubmit.sh
> > #!/bin/sh
> > /storage/SGE/bin/lx24-amd64/qresub $1
> > /storage/SGE/bin/lx24-amd64/qdel $1
> 
> is it happening only on certain hosts? In this configuration all 
> exechosts also need to be submit hosts. The other annoyance I 
> see is, that the resubmitted jobs are pushed at the end of the 
> waiting jobs again.
>  
> 
> > When even 1 job from private queue submitted, public jobs have 
> to be resubmitted and killed.
> > Sometimes it doesn't work, they got status S (suspend)
> > 
> sge143                  lx24-amd64     24 45.65   47.3G   30.4G   48.0G     
> 0.0
> >    
> namd.q               BIP   24/24    
> >    
> intel_all.q          BIP   23/24    S
> > 
> > 5219266 0.50511 SemanticEx 
> alexla       S     12/29/2011 16:34:08 intel_all.q@sge143                 1
> 
> Maybe the `qdel` didn't succeed. You can check the messages 
> files of the qmaster and the exechost whether it was issued and 
> executed. If the job isn't deleted or stopped by a signal, they 
> will continue as you observe it right now.
> 
> I would suggest to remove the suspend_method, and define a 
> checkpointing interface, which is attached to intel_all.q and to 
> reach this queue it's then sufficient to request the 
> checkpointing interface. When the chechkpointing interface is 
> setup to migrate on suspend, the job (with still the same 
> jobnumber) will be requeued automatically.
> 
> http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2193
> 
> -- Reuti
> 
> 
> > and stiil actually running and take resources of the node (CPU 
> & memory).
> > How I can solve this problem?
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
>‎
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to