Hi Reuti,

I don't believe anyone has adjusted the scheduler from defaults but I see:
schedule_interval                 00:00:04
flush_submit_sec                  1
flush_finish_sec                  1

For the qlogin side, I've confirmed that there is no firewall and previously a 
reboot alleviated all issues we were seeing for atleast some time, though the 
duration seems to be getting smaller... we had to reboot the server 3 weeks ago 
for the same issue.

Regards,

Derek
-----Original Message-----
From: Reuti <re...@staff.uni-marburg.de> 
Sent: January 18, 2019 4:51 AM
To: Derek Stephenson <derek.stephen...@awaveip.com>
Cc: users@gridengine.org
Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading


> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
> <derek.stephen...@awaveip.com>:
> 
> Hello,
> 
> I should preface this with I've just recently started getting my head around 
> grid engine and as such may not have all the information I should for 
> administering the grid but someone's has to do it. Anyways...
> 
> Our company across an issue recently where a one of the nodes seems to become 
> very delayed in its response to grid submissions.  Whether it be a qsub, qrsh 
> or qlogin submission jobs can take anywhere from 30s to 4-5min to 
> successfully submit. In particular, while users may complain a qsub job looks 
> like it has submitted but do nothing, doing a qlogin to the node in question 
> will give the following:

This might at least for `qsub` jobs depend on when it was submitted inside the 
defined scheduling interval. What is the setting of:

$ qconf -ssconf
...
schedule_interval                 0:2:0
...
flush_submit_sec                  4
flush_finish_sec                  4


> Your job 287104 ("QLOGIN") has been submitted waiting for interactive 
> job to be scheduled ...timeout (3 s) expired while waiting on socket 
> fd 7

For interactive jobs: any firewall in place, blocking the communication between 
the submission host and the exechost - maybe switched on at a later point in 
time? SGE will use a random port for the communication. After the reboot it 
worked instantly again?

-- Reuti


> Now I've seen  a series of forum articles bring up this message while 
> seaching through back logs but there never seems to be any conclusions in 
> those threads for me to start delving into on our end. 
> 
> Our past attempts to resolve the issue have only succeeded by rebooting the 
> node in question, and not having any real ideas on why is becoming a general 
> frustration.  
> 
> Any initial thoughts/pointers would be greatly appreciated
> 
> Kind Regards,
> 
> Derek
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to