Re: [gridengine users] Dilemma with exec node reponsiveness degrading

Reuti Fri, 18 Jan 2019 08:28:15 -0800

> Am 18.01.2019 um 16:26 schrieb Derek Stephenson 
> <derek.stephen...@awaveip.com>:
> 
> Hi Reuti,
> 
> I don't believe anyone has adjusted the scheduler from defaults but I see:
> schedule_interval                 00:00:04
> flush_submit_sec                  1
> flush_finish_sec                  1


With a schedule interval of 4 seconds I would set the flush values to zero to 
avoid a too high load on the qmaster. But this shouldn't be related to the 
behavior you observe. Are you running jobs with only a few seconds runtime? 
Otherwise even a larger schedule interval would do.


> For the qlogin side, I've confirmed that there is no firewall and previously 
> a reboot alleviated all issues we were seeing for atleast some time, though 
> the duration seems to be getting smaller... we had to reboot the server 3 
> weeks ago for the same issue.

Was there anything else running on the node – inside or outside SGE?

Were any processes left behind by a former interactive session?

What is the value of:

$ qconf -sconf
…
gid_range                    20000-20100

and how many cores are available per node?

-- Reuti


> Regards,
> 
> Derek
> -----Original Message-----
> From: Reuti <re...@staff.uni-marburg.de> 
> Sent: January 18, 2019 4:51 AM
> To: Derek Stephenson <derek.stephen...@awaveip.com>
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading
> 
> 
>> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
>> <derek.stephen...@awaveip.com>:
>> 
>> Hello,
>> 
>> I should preface this with I've just recently started getting my head around 
>> grid engine and as such may not have all the information I should for 
>> administering the grid but someone's has to do it. Anyways...
>> 
>> Our company across an issue recently where a one of the nodes seems to 
>> become very delayed in its response to grid submissions.  Whether it be a 
>> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min to 
>> successfully submit. In particular, while users may complain a qsub job 
>> looks like it has submitted but do nothing, doing a qlogin to the node in 
>> question will give the following:
> 
> This might at least for `qsub` jobs depend on when it was submitted inside 
> the defined scheduling interval. What is the setting of:
> 
> $ qconf -ssconf
> ...
> schedule_interval                 0:2:0
> ...
> flush_submit_sec                  4
> flush_finish_sec                  4
> 
> 
>> Your job 287104 ("QLOGIN") has been submitted waiting for interactive 
>> job to be scheduled ...timeout (3 s) expired while waiting on socket 
>> fd 7
> 
> For interactive jobs: any firewall in place, blocking the communication 
> between the submission host and the exechost - maybe switched on at a later 
> point in time? SGE will use a random port for the communication. After the 
> reboot it worked instantly again?
> 
> -- Reuti
> 
> 
>> Now I've seen  a series of forum articles bring up this message while 
>> seaching through back logs but there never seems to be any conclusions in 
>> those threads for me to start delving into on our end. 
>> 
>> Our past attempts to resolve the issue have only succeeded by rebooting the 
>> node in question, and not having any real ideas on why is becoming a general 
>> frustration.  
>> 
>> Any initial thoughts/pointers would be greatly appreciated
>> 
>> Kind Regards,
>> 
>> Derek
>> 
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

Reply via email to