Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

Mark Dixon Thu, 29 Jun 2017 06:48:53 -0700

On Tue, 27 Jun 2017, juanesteban.jime...@mdc-berlin.de wrote:

Never mind. One of my users submitted a job with 139k subjobs.

...

Hi,

I don't think I have all the messages from this thread for some reason. Nodoubt I'm going to repeat things someone else has suggested - apologies inadvance :)

Firstly, make sure you've obtimised your disk I/O. This typically meansmaking sure $SGE_ROOT is on a filesystem local to your qmaster, andreducing nfs traffic from your compute nodes by making their spools localto each node (they end up in $SGE_ROOT/$SGE_CELL/spool/<HOSTNAME> bydefault, but you can choose somewhere else at install time - replace witha symlink later if it's an existing install), although the messages filein there is useful to be central (again, a symlink, and its nfs trafficdoesn't seem to slow things down). People seem to get good results doingthis and sticking with classic spooling. Certainly I do :)

Secondly, you talk about 139k subjobs - so this is a task array, right?That is large. You should find that the qmaster handles its memory betterwith large task arrays if you can live with setting schedd_job_info tofalse in 'qconf -msconf' - it stops the qmaster from collecting the infoshown under 'scheduling info' in the output of 'qstat -j <jid>'.


Hope this helps,

Mark
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

Reply via email to