> Am 18.01.2019 um 16:26 schrieb Derek Stephenson > <derek.stephen...@awaveip.com>: > > Hi Reuti, > > I don't believe anyone has adjusted the scheduler from defaults but I see: > schedule_interval 00:00:04 > flush_submit_sec 1 > flush_finish_sec 1
With a schedule interval of 4 seconds I would set the flush values to zero to avoid a too high load on the qmaster. But this shouldn't be related to the behavior you observe. Are you running jobs with only a few seconds runtime? Otherwise even a larger schedule interval would do. > For the qlogin side, I've confirmed that there is no firewall and previously > a reboot alleviated all issues we were seeing for atleast some time, though > the duration seems to be getting smaller... we had to reboot the server 3 > weeks ago for the same issue. Was there anything else running on the node – inside or outside SGE? Were any processes left behind by a former interactive session? What is the value of: $ qconf -sconf … gid_range 20000-20100 and how many cores are available per node? -- Reuti > Regards, > > Derek > -----Original Message----- > From: Reuti <re...@staff.uni-marburg.de> > Sent: January 18, 2019 4:51 AM > To: Derek Stephenson <derek.stephen...@awaveip.com> > Cc: users@gridengine.org > Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading > > >> Am 18.01.2019 um 03:57 schrieb Derek Stephenson >> <derek.stephen...@awaveip.com>: >> >> Hello, >> >> I should preface this with I've just recently started getting my head around >> grid engine and as such may not have all the information I should for >> administering the grid but someone's has to do it. Anyways... >> >> Our company across an issue recently where a one of the nodes seems to >> become very delayed in its response to grid submissions. Whether it be a >> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min to >> successfully submit. In particular, while users may complain a qsub job >> looks like it has submitted but do nothing, doing a qlogin to the node in >> question will give the following: > > This might at least for `qsub` jobs depend on when it was submitted inside > the defined scheduling interval. What is the setting of: > > $ qconf -ssconf > ... > schedule_interval 0:2:0 > ... > flush_submit_sec 4 > flush_finish_sec 4 > > >> Your job 287104 ("QLOGIN") has been submitted waiting for interactive >> job to be scheduled ...timeout (3 s) expired while waiting on socket >> fd 7 > > For interactive jobs: any firewall in place, blocking the communication > between the submission host and the exechost - maybe switched on at a later > point in time? SGE will use a random port for the communication. After the > reboot it worked instantly again? > > -- Reuti > > >> Now I've seen a series of forum articles bring up this message while >> seaching through back logs but there never seems to be any conclusions in >> those threads for me to start delving into on our end. >> >> Our past attempts to resolve the issue have only succeeded by rebooting the >> node in question, and not having any real ideas on why is becoming a general >> frustration. >> >> Any initial thoughts/pointers would be greatly appreciated >> >> Kind Regards, >> >> Derek >> >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users