On Tue, Mar 21, 2017 at 08:41:53AM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
The answer to this does not lie in the number of jobs or comparing raw 
performance. Your users probably use completely different tools to generate 
jobs than mine do. Each job submitted can carry with it completely different 
amounts of data in terms of environment variables, scripts, etc.

If there's a way to submit jobs, our users use it. :-/

The "size" of job metadata (scripts, ENV, etc) doesn't really affect
the RAM usage appreciably that I've seen.  We routinely have jobs
ENVs of almost 4k or more, and it's never been a problem.  The
"data" processed by jobs isn't a factor in qmaster RAM usage, so far as
I know.

One thing I'm not sure about is submitting large, static binaries as
jobs (e.g. "qsub -b y /path/to/binary").  Since a copy of the binary is
pushed around within SGE (just like a job script is also copied), I
wonder if that could have an impact, but this also seems unlikely.

We are using SGE 8.1.8 with classic spooling. That last one is probably a 
contributor to the issue we just had, but I started working with SGE just 6 
months ago, so I am still learning the options, mostly discovering how to tune 
things after the outage. :(

I don't know that you can convert spooling methods on a "live" system
with running and pending jobs.  You might be able to drain the system of
all jobs, shutdown SGE, alter the configuration to use BDB spooling,
then start it up again.

You probably should take a look at the sge_bootstrap(5) manpage if you
go this route.

It's probably simpler to do a fresh install to a new direcory with BDB
spooling, and import the old configuration.

BDB spooling can be "faster" on large clusters; it doesn't make much
difference on small ones.  Additioanlly, if you want to use "shadow
masters" for failover, the BDB files have to be on NFS4 shares (NFS3
will ensure a corrupt spooling database...).  Shadow masters can use
NFS3 if you use classic spooling.

All that said, I'd still look at a possible memory leak instad of a
problem with the spooling method.

You might want to look at running SoGE, which has a more recent
codebase, and possibly fixes a memory leak you are hitting?
   https://arc.liv.ac.uk/trac/SGE/wiki


Also, if you are just starting out, take a look at these slides from
BioTeam.  They are a wonderful resource:
   https://bioteam.net/2009/09/sge-training-slides/
   https://bioteam.net/2011/03/grid-engine-for-users/ (this too)

This is also an excellent presentation, once you've gotten past the
"learning" stage:
   http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf


Mfg,
Juan Jimenez
System Administrator, HPC
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800


________________________________________
From: Jesse Becker [becke...@mail.nih.gov]
Sent: Monday, March 20, 2017 22:08
To: Jimenez, Juan Esteban
Cc: SGE-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] Sizing the qmaster

On Mon, Mar 20, 2017 at 08:39:38PM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
Hi folks,

I just ran into my first episode of the scheduler crashing because of too many 
submitted jobs. It pegged memory usage to as much as I could give it (12gb at 
one point) and still crashed while it tries to work its way through the stack.

How many is "too many?"  We routinely have 50,000+ jobs, and there's
nary a blip in RAM usage on the qmaster.  I'm not even sure that the 
sge_qmaster process uses a Gig of RAM...

Just checked, with 3,000+ jobs in the queue, it's got 550MB RSS, and
a total of 2.3G of virtual memory (including a large mmap of
/usr/lib/locale/locale-archive).


I need to figure out how to size a box properly for a dedicated sge_master. How 
do you folks recommend I do this?

12G should be plenty, IME.  What version are you running, and what
spooling method are you using?



--
Jesse Becker (Contractor)
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

--
Jesse Becker (Contractor)
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to