On Thu, Mar 09, 2017 at 05:20:37PM +0100, Jerome Poitout wrote: > Hello, > > OGS/GE 2011.11p1 > > I have an issue while submitting numerous jobs in a short time (over 300 > - not so much for me...) with -sync y option. It seems that qmaster > cannot handle all the requests and i get huge load on the head server > (>400) and memory gets almost full (32GB). > > These jobs are run by a third party product that does not support job > arrays (as far as we currently know). > > Then I get some timeout while trying to qstat something... > > [root@ ~]# qstat -u user > error: failed receiving gdi request response for mid=1 (got syncron > message receive timeout error).
You could try fiddling with the gdi_timeout and gdi_retries settings in the qmaster_params to see if that helps - depends where the timeout is happening though. > > Any idea on how to raise the number a jobs that can be qsub in a short > time ? I am almost sure that a qmaster params can be used but as I am in > production environment, I prefer to be careful... Where is the qmaster relative to the job spool's backing storage? While in theory the qmaster can access the job spool over NFS in practice it can slow things down enough to cause timeouts like the above. I like to keep the qmaster on the same machine where the disks which gost the job spool live. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users