Re: [gridengine users] Jobs in t status

patrick Wed, 10 Dec 2014 11:00:03 -0800

We figured out the issue. Somehow the host file on the headnodes had the
Ethernet interfaces removed so only the infiniband interfaces were listed.
This was causing communication problems in SGE. I added the Ethernet
interfaces back to the host file a month ago and the problem hasn't come
back. Just an FYI in case anyone runs into the problem.


On Tue, Oct 14, 2014 at 3:50 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 14.10.2014 um 19:41 schrieb patrick:
>
> > It is on a shared file space with RW to root.
>
> It's better to have the spool directory local on each node:
>
> https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
>
> -- Reuti
>
>
> > Get this message after the node was rebooted in the nodes message file.
> >
> >
> > 10/14/2014 11:28:16|  main|n72|E|removing unreferenced job 214320.247
> without job report from ptf
> > 10/14/2014 11:28:53|  main|n72|W|reaping job "214320" ptf complains: Job
> does not exist
> >
> > On Tue, Oct 14, 2014 at 1:15 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > Am 14.10.2014 um 19:13 schrieb patrick:
> >
> > > Just in the qmaster messages. It will give a error such as :
> > >
> > > 10/14/2014 11:05:07| timer|hn1|W|failed to deliver job 214320.423 to
> queue "all.q@n72"
> >
> > And nothing on the node? Full or write protected spooling directory? Is
> it by accident on a shared file space or local on each machine?
> >
> > -- Reuti
> >
> >
> > > On Tue, Oct 14, 2014 at 12:56 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > > Am 14.10.2014 um 18:30 schrieb patrick:
> > >
> > > > No, it will stay in 't' status and not run. Sometimes after the
> reboot they will change from 't' to 'r' and sometimes they will stay in 't'
> until deleted and resubmitted.
> > >
> > > Aha, that's strange. Anything in the message files of the qmaster or
> the exechost referring to this <job_id>s in question?
> > >
> > > -- Reuti
> > >
> > >
> > > > Thanks!
> > > >
> > > > On Tue, Oct 14, 2014 at 12:07 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > > > Hiho,
> > > >
> > > > Am 14.10.2014 um 17:10 schrieb patrick:
> > > >
> > > > > Over the past couple month's we have run into issues with jobs on
> random nodes staying in a 't' status. The only way to resolve it is to
> restart the node which makes users who run array and MPI jobs frustrated. I
> am not seeing anything in the logs to indicate an issue. It is using the
> Berkley database and I was wondering if that could be causing the issue? As
> in some maintenance needs to be done to it to keep it running smoothly?
> > > >
> > > > But the jobs ran fine essentially - so it's more a cosmetic issue?
> > > >
> > > > -- Reuti
> > > >
> > >
> > >
> >
> >
>
>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs in t status

Reply via email to