Hi Joshua, Thank you for your reply. Please see my comments below.
On Tue, Jul 03, 2018 at 11:30 AM PDT, Joshua Baker-LePain wrote: > On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote > > > We're using SGE 8.1.9 on CentOS 6.9 > > > > "All of the sudden" we've noticed that when we reboot an execution host, > > any jobs sent to it within the first 10-15 min following boot-up will > > get stuck in the 't' state until deleted (sometimes that has to be done > > forcibly). However, after 10-ish minutes, the execution host will start > > accepting jobs. > > > > In the qmaster's messages file, I see the following entries: > > > > 06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique > > error (endpoint "sim4.work.com/execd/1" is already connected) > > 06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue > > "shor...@sim4.work.com" > > 06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target > > "execd" on host "sim4.work.com", can't deliver job "54312" > > One possibility occurs to me. SoGE 8.1.9 has a bug where "qconf -s" > commands fail on non-admin hosts (see > <https://arc.liv.ac.uk/trac/SGE/ticket/1576>). One side-effect of this is > that the init script fails to properly shutdown the execd. I'm wondering > if that's leading to your problem. I don't see this, but I'm running on > CentOS-7, which may lead to some different behavior. Thanks for the suggestion but I don't believe that issue is the root cause of my problems. I don't see the same error and the host that experienced the error that I posted is also an Administrative host. Kind regards, -- Mun _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users