Hi,

> Am 23.08.2017 um 13:02 schrieb Ondrej Valousek <ondrej.valou...@s3group.com>:
> 
> Hi List,
> 
> When running qstat, I am sometimes receiving messages like:
> ''ERROR: failed receiving gdi request response for mid=1 (got syncron message 
> receive timeout error)".
> 
> Also, qping - info shows warning/error and high number of qmaster clients (> 
> 40) at times when I receive messages like above.
> So it seems to me that qmaster is not able to handle higher number of clients 
> for some reason.
> 
> I am thinking of two possible reasoning:
> 
> 1.    Buggy jsv script (but jsv should not be executed when running just 
> 'qstat' right?)

Correct.


> 2.    Qmaster spool directory stored on shared NFS storage

Yes, it would be better to have it local on the node where the qmaster is 
running (unless you wan to have a redundant setup of two qmasters, where it has 
to be on a shared device of course).


> Could someone tell me more about this? Anyone experienced similar issue? It 
> seems to me that qmaster should handle ~100 clients without any substantial 
> problem (at least machine CPU load is minimal).

If your clients are using `qstat` that often, it might be good to throttle the 
number of invocations of `qstat`. If they need this to start other jobs, one 
could look into using the job_id/job_name top start the next, use `inotify` 
(Linux) or `qevent` (SGE) and reducing the poll-load.

-- Reuti
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to