-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Bill,

On 12/9/13, 5:38 PM, Bill Davidson wrote:
> Last week, one of my servers got an OutOfMemoryError at
> approximately 1:21pm.

:(

It's worth pointing out that this is not a trivial issue.

> My monitoring software which does a heart beat check once per
> minute did not notice until 3:01pm.  Heart beat kept working for
> over an hour and a half.

Was it a transient error, or a chronic condition? A single thread can,
for instance, spew objects into its stack or eden space exhausting
memory but, when that thread hits the OOME, all those objects are
freed which basically recovers from the situation.

If, instead, you fill-up some shared cache, buffer, etc. and NO
threads can get more memory, then you're basically toast.

Which of the above was it?

> During that time my high capacity high availablity 24/7 application
> was getting occasional OutOfMemoryError's until memory got bad
> enough that even the heart beat check servlet failed.  Apparently
> some things that allocate large chunks of memory started failing
> first, but none of my customers called to complain.  Smaller stuff
> continiued to work.  I didn't know until my monitoring software
> sent me an email about the heart beat failure.
> 
> That doesn't work for me.  I need to know sooner.

+1

> I thought of trying to handle it with error-page in web.xml.
> Apparently that does not work.  I used java.lang.Throwable as the
> exception-type. I was already using this for a number of common
> exceptions to send me email.

In most OOME situations, your recovery options are limited... because
the JVM might need to allocate (a small amount of) memory in order to
even report the error.

> I see the OutOfMemoryError's logged in my catalina.out
> 
> Is there some way that I can catch this so that I can send email or
> something?  I need to know as soon as possible so that I can 
> attempt diagnosis and restart the server.  Google has not been
> helpful. Everything says that you have to fix the memory leak.
> Duh.  I know that. We've fixed many over the years.  We haven't had
> one in nearly 2 years. We thought we'd fixed them all.  We need to
> find out about them sooner when they do happen.

There are a bunch of things you can try to do. They all have their
caveats, failure scenarios, and inefficacies.

1. Use -XX:OnOutOfMemoryError="cmd args;cmd args"

Rig this to email you, register a passive-check data point with your
monitoring server, etc. Just remember that OOMEs happen for a number
of reasons. You could have run out of file handles or you could have
run out of heap space.

2. Use JMX monitoring, set java.lang:MemoryPool/[heap
space]/UsageThreshold to whatever byte value you want to set as your
limit. Then, check java.lang:MemoryPool/[heap
space]/UsageThresholdExceeded to see if it is true. If so, your usage
threshold has been exceeded.

Note that this is not proof-positive than an OOME occurred. It's also
tough to tell what value to use for the threshold. You can't really
set it to MaxHeap - 1 byte, because you'll never get that value in
practice. If you set it too low, you'll get warnings all the time when
your heap usage rises in the normal course of business.

3. catch IOException in a filter and set an application attribute.
Check this attribute from your monitor.

I've been considering doing this, because I can rig it so that the
error handler does not actually require any memory to run. The problem
is that sometimes OOMEs interrupt one thread and not another. You may
not catch the OOME in that thread -- it may happen in a background
thread that does not go through the filter.

4. You can do what I do: simply look at your total heap space by
inspecting java.lang:Memory/HeapMemoryUsage["used"] and set a
threshold that will cause your monitor to alarm for WARNING and
CRITICAL conditions. You may recover and not have to check anything.
These days, I get a false-alarm about once every 3 weeks when the heap
space grows a hair higher than usual before a full GC runs and clears
everything out.

The nice thing about #4 is that you can find our early if you *might*
be having a problem. Then you can keep an eye on your service to make
sure it "recovers". If it never OOME's, great. If it does, you can
manually restart or whatever. If it OOME's, and #1-#3 above fail
because memory might be required to actually execute the
do-this-thing-on-OOME action, then you might never get notified. With
#4, you don't have to wait until an OOME to take action.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJSpk4+AAoJEBzwKT+lPKRYsCIP/0XZ/v8njibLl1ECnpByBagB
jtqCeE78lsHdWouoW7ydIpgmSP60KqvHtMemQUoS3STpn52ahNv/hf8imnybgByv
smtTxq0cbFNsnHqJiUb/VQtyK5bnqW7u+mLxwvvt1uIwHUoX5QyTZCUBQqvbUuDM
JRexqlFZIGzoiXLNUc5Z+Lg36IBZ8xO6/wlC014GQJTtbc71TS06gxTOKNDNTyuO
T4SGsvqdzHAIvnJ77XbDpRmFv0wPMiwCJhCCD/ZLQ+WKbn+MVa5MHsjBbdHT8PZp
ggk/haWCYhu8wzE3gs1gfC4gvwNkLHiGXUe3smrV0QiGSb4wjUGHEI0LRthRvPP2
tl92yrrjE3jKBgEwS7Bh51btf7sP+fOmuUczKIKKhhC17H3+Pxy/uQYm+kplTQl4
9n09f9IobQH1diafqAanrKer8p4uNq2Q9OK06nwwRWWV/Fe9zqRXJMViozjmbqQB
Bw2uSIAEAvEAhQteo4h+1oObrLxzAp1VUFo5J8y/tZqxc04sv3uoM3NIXZlSKUii
ulc98SCL6zZJmjflSPWqvgGebTYbpvJT46dkQ3lFkMjjjsJQP2J6Wh3ySzsJ48eS
KH6knpkEwQe/IhRrXPn7bDGO1/92Je5IFZcVQI2vtxD2DUzNDViTyamCO5HSJEDx
ZjZkTpmZ+PPsXgmfaTGX
=uZkt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to