On Tue, Mar 21, 2017 at 08:41:53AM +0000, juanesteban.jime...@mdc-berlin.de
wrote:
The answer to this does not lie in the number of jobs or comparing raw
performance. Your users probably use completely different tools to generate
jobs than mine do. Each job submitted can carry with it completely different
amounts of data in terms of environment variables, scripts, etc.
If there's a way to submit jobs, our users use it. :-/
The "size" of job metadata (scripts, ENV, etc) doesn't really affect
the RAM usage appreciably that I've seen. We routinely have jobs
ENVs of almost 4k or more, and it's never been a problem. The
"data" processed by jobs isn't a factor in qmaster RAM usage, so far as
I know.
One thing I'm not sure about is submitting large, static binaries as
jobs (e.g. "qsub -b y /path/to/binary"). Since a copy of the binary is
pushed around within SGE (just like a job script is also copied), I
wonder if that could have an impact, but this also seems unlikely.
We are using SGE 8.1.8 with classic spooling. That last one is probably a
contributor to the issue we just had, but I started working with SGE just 6
months ago, so I am still learning the options, mostly discovering how to tune
things after the outage. :(
I don't know that you can convert spooling methods on a "live" system
with running and pending jobs. You might be able to drain the system of
all jobs, shutdown SGE, alter the configuration to use BDB spooling,
then start it up again.
You probably should take a look at the sge_bootstrap(5) manpage if you
go this route.
It's probably simpler to do a fresh install to a new direcory with BDB
spooling, and import the old configuration.
BDB spooling can be "faster" on large clusters; it doesn't make much
difference on small ones. Additioanlly, if you want to use "shadow
masters" for failover, the BDB files have to be on NFS4 shares (NFS3
will ensure a corrupt spooling database...). Shadow masters can use
NFS3 if you use classic spooling.
All that said, I'd still look at a possible memory leak instad of a
problem with the spooling method.
You might want to look at running SoGE, which has a more recent
codebase, and possibly fixes a memory leak you are hitting?
https://arc.liv.ac.uk/trac/SGE/wiki
Also, if you are just starting out, take a look at these slides from
BioTeam. They are a wonderful resource:
https://bioteam.net/2009/09/sge-training-slides/
https://bioteam.net/2011/03/grid-engine-for-users/ (this too)
This is also an excellent presentation, once you've gotten past the
"learning" stage:
http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf
Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
________________________________________
From: Jesse Becker [becke...@mail.nih.gov]
Sent: Monday, March 20, 2017 22:08
To: Jimenez, Juan Esteban
Cc: SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Sizing the qmaster
On Mon, Mar 20, 2017 at 08:39:38PM +0000, juanesteban.jime...@mdc-berlin.de
wrote:
Hi folks,
I just ran into my first episode of the scheduler crashing because of too many
submitted jobs. It pegged memory usage to as much as I could give it (12gb at
one point) and still crashed while it tries to work its way through the stack.
How many is "too many?" We routinely have 50,000+ jobs, and there's
nary a blip in RAM usage on the qmaster. I'm not even sure that the
sge_qmaster process uses a Gig of RAM...
Just checked, with 3,000+ jobs in the queue, it's got 550MB RSS, and
a total of 2.3G of virtual memory (including a large mmap of
/usr/lib/locale/locale-archive).
I need to figure out how to size a box properly for a dedicated sge_master. How
do you folks recommend I do this?
12G should be plenty, IME. What version are you running, and what
spooling method are you using?
--
Jesse Becker (Contractor)
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss
--
Jesse Becker (Contractor)
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss