On Thu, Mar 09, 2017 at 05:20:37PM +0100, Jerome Poitout wrote:
> Hello,
> 
> OGS/GE 2011.11p1
> 
> I have an issue while submitting numerous jobs in a short time (over 300
> - not so much for me...) with -sync y option. It seems that qmaster
> cannot handle all the requests and i get huge load on the head server
> (>400) and memory gets almost full (32GB).
> 
> These jobs are run by a third party product that does not support job
> arrays (as far as we currently know).
> 
> Then I get some timeout while trying to qstat something...
> 
> [root@ ~]# qstat -u user
> error: failed receiving gdi request response for mid=1 (got syncron
> message receive timeout error).

You could try fiddling with the gdi_timeout and gdi_retries settings in the 
qmaster_params to see if that helps - depends where the timeout is 
happening though.  

> 
> Any idea on how to raise the number a jobs that can be qsub in a short
> time ? I am almost sure that a qmaster params can be used but as I am in
> production environment, I prefer to be careful...

Where is the qmaster relative to the job spool's backing storage?  While in 
theory the qmaster can access the job spool over NFS in practice it can slow 
things down enough to cause timeouts like the above.  I like to keep the 
qmaster on the same machine where the disks which gost the job spool live.

William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to