Hi Reuti, I don't believe anyone has adjusted the scheduler from defaults but I see: schedule_interval 00:00:04 flush_submit_sec 1 flush_finish_sec 1
For the qlogin side, I've confirmed that there is no firewall and previously a reboot alleviated all issues we were seeing for atleast some time, though the duration seems to be getting smaller... we had to reboot the server 3 weeks ago for the same issue. Regards, Derek -----Original Message----- From: Reuti <re...@staff.uni-marburg.de> Sent: January 18, 2019 4:51 AM To: Derek Stephenson <derek.stephen...@awaveip.com> Cc: users@gridengine.org Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading > Am 18.01.2019 um 03:57 schrieb Derek Stephenson > <derek.stephen...@awaveip.com>: > > Hello, > > I should preface this with I've just recently started getting my head around > grid engine and as such may not have all the information I should for > administering the grid but someone's has to do it. Anyways... > > Our company across an issue recently where a one of the nodes seems to become > very delayed in its response to grid submissions. Whether it be a qsub, qrsh > or qlogin submission jobs can take anywhere from 30s to 4-5min to > successfully submit. In particular, while users may complain a qsub job looks > like it has submitted but do nothing, doing a qlogin to the node in question > will give the following: This might at least for `qsub` jobs depend on when it was submitted inside the defined scheduling interval. What is the setting of: $ qconf -ssconf ... schedule_interval 0:2:0 ... flush_submit_sec 4 flush_finish_sec 4 > Your job 287104 ("QLOGIN") has been submitted waiting for interactive > job to be scheduled ...timeout (3 s) expired while waiting on socket > fd 7 For interactive jobs: any firewall in place, blocking the communication between the submission host and the exechost - maybe switched on at a later point in time? SGE will use a random port for the communication. After the reboot it worked instantly again? -- Reuti > Now I've seen a series of forum articles bring up this message while > seaching through back logs but there never seems to be any conclusions in > those threads for me to start delving into on our end. > > Our past attempts to resolve the issue have only succeeded by rebooting the > node in question, and not having any real ideas on why is becoming a general > frustration. > > Any initial thoughts/pointers would be greatly appreciated > > Kind Regards, > > Derek > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users