We figured out the issue. Somehow the host file on the headnodes had the
Ethernet interfaces removed so only the infiniband interfaces were listed.
This was causing communication problems in SGE. I added the Ethernet
interfaces back to the host file a month ago and the problem hasn't come
back. Just an FYI in case anyone runs into the problem.

On Tue, Oct 14, 2014 at 3:50 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 14.10.2014 um 19:41 schrieb patrick:
>
> > It is on a shared file space with RW to root.
>
> It's better to have the spool directory local on each node:
>
> https://arc.liv.ac.uk/SGE/howto/nfsreduce.html
>
> -- Reuti
>
>
> > Get this message after the node was rebooted in the nodes message file.
> >
> >
> > 10/14/2014 11:28:16|  main|n72|E|removing unreferenced job 214320.247
> without job report from ptf
> > 10/14/2014 11:28:53|  main|n72|W|reaping job "214320" ptf complains: Job
> does not exist
> >
> > On Tue, Oct 14, 2014 at 1:15 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > Am 14.10.2014 um 19:13 schrieb patrick:
> >
> > > Just in the qmaster messages. It will give a error such as :
> > >
> > > 10/14/2014 11:05:07| timer|hn1|W|failed to deliver job 214320.423 to
> queue "all.q@n72"
> >
> > And nothing on the node? Full or write protected spooling directory? Is
> it by accident on a shared file space or local on each machine?
> >
> > -- Reuti
> >
> >
> > > On Tue, Oct 14, 2014 at 12:56 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > > Am 14.10.2014 um 18:30 schrieb patrick:
> > >
> > > > No, it will stay in 't' status and not run. Sometimes after the
> reboot they will change from 't' to 'r' and sometimes they will stay in 't'
> until deleted and resubmitted.
> > >
> > > Aha, that's strange. Anything in the message files of the qmaster or
> the exechost referring to this <job_id>s in question?
> > >
> > > -- Reuti
> > >
> > >
> > > > Thanks!
> > > >
> > > > On Tue, Oct 14, 2014 at 12:07 PM, Reuti <re...@staff.uni-marburg.de>
> wrote:
> > > > Hiho,
> > > >
> > > > Am 14.10.2014 um 17:10 schrieb patrick:
> > > >
> > > > > Over the past couple month's we have run into issues with jobs on
> random nodes staying in a 't' status. The only way to resolve it is to
> restart the node which makes users who run array and MPI jobs frustrated. I
> am not seeing anything in the logs to indicate an issue. It is using the
> Berkley database and I was wondering if that could be causing the issue? As
> in some maintenance needs to be done to it to keep it running smoothly?
> > > >
> > > > But the jobs ran fine essentially - so it's more a cosmetic issue?
> > > >
> > > > -- Reuti
> > > >
> > >
> > >
> >
> >
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to