Am 14.10.2014 um 19:41 schrieb patrick:

> It is on a shared file space with RW to root.

It's better to have the spool directory local on each node:

https://arc.liv.ac.uk/SGE/howto/nfsreduce.html

-- Reuti


> Get this message after the node was rebooted in the nodes message file. 
> 
> 
> 10/14/2014 11:28:16|  main|n72|E|removing unreferenced job 214320.247 without 
> job report from ptf
> 10/14/2014 11:28:53|  main|n72|W|reaping job "214320" ptf complains: Job does 
> not exist
> 
> On Tue, Oct 14, 2014 at 1:15 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 14.10.2014 um 19:13 schrieb patrick:
> 
> > Just in the qmaster messages. It will give a error such as :
> >
> > 10/14/2014 11:05:07| timer|hn1|W|failed to deliver job 214320.423 to queue 
> > "all.q@n72"
> 
> And nothing on the node? Full or write protected spooling directory? Is it by 
> accident on a shared file space or local on each machine?
> 
> -- Reuti
> 
> 
> > On Tue, Oct 14, 2014 at 12:56 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> > Am 14.10.2014 um 18:30 schrieb patrick:
> >
> > > No, it will stay in 't' status and not run. Sometimes after the reboot 
> > > they will change from 't' to 'r' and sometimes they will stay in 't' 
> > > until deleted and resubmitted.
> >
> > Aha, that's strange. Anything in the message files of the qmaster or the 
> > exechost referring to this <job_id>s in question?
> >
> > -- Reuti
> >
> >
> > > Thanks!
> > >
> > > On Tue, Oct 14, 2014 at 12:07 PM, Reuti <re...@staff.uni-marburg.de> 
> > > wrote:
> > > Hiho,
> > >
> > > Am 14.10.2014 um 17:10 schrieb patrick:
> > >
> > > > Over the past couple month's we have run into issues with jobs on 
> > > > random nodes staying in a 't' status. The only way to resolve it is to 
> > > > restart the node which makes users who run array and MPI jobs 
> > > > frustrated. I am not seeing anything in the logs to indicate an issue. 
> > > > It is using the Berkley database and I was wondering if that could be 
> > > > causing the issue? As in some maintenance needs to be done to it to 
> > > > keep it running smoothly?
> > >
> > > But the jobs ran fine essentially - so it's more a cosmetic issue?
> > >
> > > -- Reuti
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to