> Am 18.01.2019 um 18:03 schrieb Derek Stephenson 
> <derek.stephen...@awaveip.com>:
> 
> There are 32 cores on the machine and it use is split between interactive and 
> non-interactive jobs. This mix is similar on other nodes as well that we 
> don't experience this issue. The split is doen as our interactive jobs tend 
> to be memory intensive but CPU light and the non-interactive tend to be CPU 
> heavy and memory light. So there are other process running on the node that 
> are inside SGE. But only root related system processes are running outside of 
> SGE.
> 
> I did find a few processes that were left behind but cleaning those out has 
> no impact. 
> 
> The gid_range is the default:
> gid_range                    20000-20100

This is fine. I thought that SGE is waiting for a free GID to start the new job.

Is there anything left behind in memory, e.g. shared memory listed by `ipcs` 
and it starts to swap?

-- Reuti


> 
> Regards,
> 
> Derek
> -----Original Message-----
> From: Reuti <re...@staff.uni-marburg.de> 
> Sent: January 18, 2019 11:26 AM
> To: Derek Stephenson <derek.stephen...@awaveip.com>
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading
> 
> 
>> Am 18.01.2019 um 16:26 schrieb Derek Stephenson 
>> <derek.stephen...@awaveip.com>:
>> 
>> Hi Reuti,
>> 
>> I don't believe anyone has adjusted the scheduler from defaults but I see:
>> schedule_interval                 00:00:04
>> flush_submit_sec                  1
>> flush_finish_sec                  1
> 
> With a schedule interval of 4 seconds I would set the flush values to zero to 
> avoid a too high load on the qmaster. But this shouldn't be related to the 
> behavior you observe. Are you running jobs with only a few seconds runtime? 
> Otherwise even a larger schedule interval would do.
> 
> 
>> For the qlogin side, I've confirmed that there is no firewall and previously 
>> a reboot alleviated all issues we were seeing for atleast some time, though 
>> the duration seems to be getting smaller... we had to reboot the server 3 
>> weeks ago for the same issue.
> 
> Was there anything else running on the node – inside or outside SGE?
> 
> Were any processes left behind by a former interactive session?
> 
> What is the value of:
> 
> $ qconf -sconf
> …
> gid_range                    20000-20100
> 
> and how many cores are available per node?
> 
> -- Reuti
> 
> 
>> Regards,
>> 
>> Derek
>> -----Original Message-----
>> From: Reuti <re...@staff.uni-marburg.de>
>> Sent: January 18, 2019 4:51 AM
>> To: Derek Stephenson <derek.stephen...@awaveip.com>
>> Cc: users@gridengine.org
>> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness 
>> degrading
>> 
>> 
>>> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
>>> <derek.stephen...@awaveip.com>:
>>> 
>>> Hello,
>>> 
>>> I should preface this with I've just recently started getting my head 
>>> around grid engine and as such may not have all the information I should 
>>> for administering the grid but someone's has to do it. Anyways...
>>> 
>>> Our company across an issue recently where a one of the nodes seems to 
>>> become very delayed in its response to grid submissions.  Whether it be a 
>>> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min 
>>> to successfully submit. In particular, while users may complain a qsub job 
>>> looks like it has submitted but do nothing, doing a qlogin to the node in 
>>> question will give the following:
>> 
>> This might at least for `qsub` jobs depend on when it was submitted inside 
>> the defined scheduling interval. What is the setting of:
>> 
>> $ qconf -ssconf
>> ...
>> schedule_interval                 0:2:0
>> ...
>> flush_submit_sec                  4
>> flush_finish_sec                  4
>> 
>> 
>>> Your job 287104 ("QLOGIN") has been submitted waiting for interactive 
>>> job to be scheduled ...timeout (3 s) expired while waiting on socket 
>>> fd 7
>> 
>> For interactive jobs: any firewall in place, blocking the communication 
>> between the submission host and the exechost - maybe switched on at a later 
>> point in time? SGE will use a random port for the communication. After the 
>> reboot it worked instantly again?
>> 
>> -- Reuti
>> 
>> 
>>> Now I've seen  a series of forum articles bring up this message while 
>>> seaching through back logs but there never seems to be any conclusions in 
>>> those threads for me to start delving into on our end. 
>>> 
>>> Our past attempts to resolve the issue have only succeeded by rebooting the 
>>> node in question, and not having any real ideas on why is becoming a 
>>> general frustration.  
>>> 
>>> Any initial thoughts/pointers would be greatly appreciated
>>> 
>>> Kind Regards,
>>> 
>>> Derek
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to