> Am 18.01.2019 um 18:03 schrieb Derek Stephenson > <derek.stephen...@awaveip.com>: > > There are 32 cores on the machine and it use is split between interactive and > non-interactive jobs. This mix is similar on other nodes as well that we > don't experience this issue. The split is doen as our interactive jobs tend > to be memory intensive but CPU light and the non-interactive tend to be CPU > heavy and memory light. So there are other process running on the node that > are inside SGE. But only root related system processes are running outside of > SGE. > > I did find a few processes that were left behind but cleaning those out has > no impact. > > The gid_range is the default: > gid_range 20000-20100
This is fine. I thought that SGE is waiting for a free GID to start the new job. Is there anything left behind in memory, e.g. shared memory listed by `ipcs` and it starts to swap? -- Reuti > > Regards, > > Derek > -----Original Message----- > From: Reuti <re...@staff.uni-marburg.de> > Sent: January 18, 2019 11:26 AM > To: Derek Stephenson <derek.stephen...@awaveip.com> > Cc: users@gridengine.org > Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading > > >> Am 18.01.2019 um 16:26 schrieb Derek Stephenson >> <derek.stephen...@awaveip.com>: >> >> Hi Reuti, >> >> I don't believe anyone has adjusted the scheduler from defaults but I see: >> schedule_interval 00:00:04 >> flush_submit_sec 1 >> flush_finish_sec 1 > > With a schedule interval of 4 seconds I would set the flush values to zero to > avoid a too high load on the qmaster. But this shouldn't be related to the > behavior you observe. Are you running jobs with only a few seconds runtime? > Otherwise even a larger schedule interval would do. > > >> For the qlogin side, I've confirmed that there is no firewall and previously >> a reboot alleviated all issues we were seeing for atleast some time, though >> the duration seems to be getting smaller... we had to reboot the server 3 >> weeks ago for the same issue. > > Was there anything else running on the node – inside or outside SGE? > > Were any processes left behind by a former interactive session? > > What is the value of: > > $ qconf -sconf > … > gid_range 20000-20100 > > and how many cores are available per node? > > -- Reuti > > >> Regards, >> >> Derek >> -----Original Message----- >> From: Reuti <re...@staff.uni-marburg.de> >> Sent: January 18, 2019 4:51 AM >> To: Derek Stephenson <derek.stephen...@awaveip.com> >> Cc: users@gridengine.org >> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness >> degrading >> >> >>> Am 18.01.2019 um 03:57 schrieb Derek Stephenson >>> <derek.stephen...@awaveip.com>: >>> >>> Hello, >>> >>> I should preface this with I've just recently started getting my head >>> around grid engine and as such may not have all the information I should >>> for administering the grid but someone's has to do it. Anyways... >>> >>> Our company across an issue recently where a one of the nodes seems to >>> become very delayed in its response to grid submissions. Whether it be a >>> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min >>> to successfully submit. In particular, while users may complain a qsub job >>> looks like it has submitted but do nothing, doing a qlogin to the node in >>> question will give the following: >> >> This might at least for `qsub` jobs depend on when it was submitted inside >> the defined scheduling interval. What is the setting of: >> >> $ qconf -ssconf >> ... >> schedule_interval 0:2:0 >> ... >> flush_submit_sec 4 >> flush_finish_sec 4 >> >> >>> Your job 287104 ("QLOGIN") has been submitted waiting for interactive >>> job to be scheduled ...timeout (3 s) expired while waiting on socket >>> fd 7 >> >> For interactive jobs: any firewall in place, blocking the communication >> between the submission host and the exechost - maybe switched on at a later >> point in time? SGE will use a random port for the communication. After the >> reboot it worked instantly again? >> >> -- Reuti >> >> >>> Now I've seen a series of forum articles bring up this message while >>> seaching through back logs but there never seems to be any conclusions in >>> those threads for me to start delving into on our end. >>> >>> Our past attempts to resolve the issue have only succeeded by rebooting the >>> node in question, and not having any real ideas on why is becoming a >>> general frustration. >>> >>> Any initial thoughts/pointers would be greatly appreciated >>> >>> Kind Regards, >>> >>> Derek >>> >>> _______________________________________________ >>> users mailing list >>> users@gridengine.org >>> https://gridengine.org/mailman/listinfo/users >> >> > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users