On Tue, 27 Jun 2017, juanesteban.jime...@mdc-berlin.de wrote:
Never mind. One of my users submitted a job with 139k subjobs.
...
Hi,
I don't think I have all the messages from this thread for some reason. No
doubt I'm going to repeat things someone else has suggested - apologies in
advance :)
Firstly, make sure you've obtimised your disk I/O. This typically means
making sure $SGE_ROOT is on a filesystem local to your qmaster, and
reducing nfs traffic from your compute nodes by making their spools local
to each node (they end up in $SGE_ROOT/$SGE_CELL/spool/<HOSTNAME> by
default, but you can choose somewhere else at install time - replace with
a symlink later if it's an existing install), although the messages file
in there is useful to be central (again, a symlink, and its nfs traffic
doesn't seem to slow things down). People seem to get good results doing
this and sticking with classic spooling. Certainly I do :)
Secondly, you talk about 139k subjobs - so this is a task array, right?
That is large. You should find that the qmaster handles its memory better
with large task arrays if you can live with setting schedd_job_info to
false in 'qconf -msconf' - it stops the qmaster from collecting the info
shown under 'scheduling info' in the output of 'qstat -j <jid>'.
Hope this helps,
Mark
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss