> in the meantime I've added RAM and extended the heap to 2GB. But still I'm
> getting crashes of PuppetDB.
> Last time it was the kernel OOM that killed the java process as I saw in
> /var/log/messages
> kernel: Out of memory: Kill process 10146 (java) score 158 or sacrifice
> child
> kernel: Killed process 10146, UID 498, (java) total-vm:3649108kB,
> anon-rss:1821580kB, file-rss:72kB

This kind of crash is more to do with the tuning on your Linux
instance usually. The OOM killer formula is somewhat tricky, but as a
general rule it takes into account the amount of RAM + swap (which
most people don't expect). So if your swap is zero, or very low in
relation to your memory, you may find the OOM killer is killing
processes before your RAM fills up.

The thing you want to research is overcommit_ratio, or your swap
allocation. There are lots of articles online about this.

As a general rule, if you're running with low swap you need the
overcommit_ratio to be higher, by default its set to 50% of the total
virtual memory ordinarily I think, so if a process tries to allocate
memory, and you've exceeded 50% of your overall RAM + swap, OOM killer
will kick in.

Here's an example from one of my instances, so you can see how to analyze this:

root@puppetdb1:~# free
             total       used       free     shared    buffers     cached
Mem:       2054120     975464    1078656          0     169772     219876
-/+ buffers/cache:     585816    1468304
Swap:       892924          0     892924
root@puppetdb1:~# cat /proc/sys/vm/overcommit_ratio
50
root@puppetdb1:~#

So in my case, the total virtual memory available is 2.8 GB (2 GB +
800 MB swap), and if a process tries to allocated more than 50% of it
(1.4 GB), the oom killer might kick in.

I'm obviously trivialising the whole story for brevity, the OOM killer
has a few little quirks that might affect all of this (and quite a few
independant tunables), but more often than not if you think you have
enough RAM but the oom killer is still killing your processes its
somewhere between your swap + overcomit_ratio. I've seen this a bit in
virtualised environments, and places that try to launch instances with
zero swap as an example.

> Does this mean I need to add additional heap space and/or RAM?
> I'm looking at the dashboard and it's as if the heap slowly increases. Right
> after startup it's 3 of 400 MB, after a day or 1 I'm over 1 GB ....

If its not crashing in the JVM any more, I'd focus on the tuning issue
above. If the Java instance is still crashing you can try increasing
it to see where the stable point is. If its starts to get stupid, and
you still don't know why its possible to analyze the heap dumps that
get left behind for any clues potentially, but about 90% of the time I
find its a large catalog causing it, so I'd focus there first.

Heap memory can naturally fluctuate over time ... and yes sometimes it
can increase but usually it should be garbage collecting and going up
and down, so it depends on what you are monitoring exactly, and where
you got the number I guess.

> I have no clue in how to find out what exactly is wrong. When I was running
> PuppetDB 1.6 I could do with 256MB heap space.
> Does anyone have an idea how to investigate what's wrong?

Well, it shouldn't have gotten any worse in the later versions afaik,
what version are you running now?

So here is the common memory-bloat story as I know it.

During ingestion of new content, PuppetDB has to hold the new
catalogs/facts/reports in memory for a short period of time as it
decodes the JSON and stores it in the internal queue ... at that point
there are really two copies running around, one is JSON, the other is
in the internal Clojure data structure. After that there are command
listener threads that process these 'commands', storing them in the
database. Sometimes, a very large catalog can cause a problem with
memory bloat, and if you happen to receive more than 1 at a time, it
can be much worse. At the same time, a very large factset or report
can also cause issues, like if you are storing a lot of information in
facts for example.

But more often than not, its to do with a large catalog, and while
most catalogs are sane there are a few cases that bloat them. A
combinatorial edge problem can often cause this, so doing something
like:

File<||>->Package<||>

(whereby we are trying to create a relationship between all file &
package resources, as an example here)

Can cause a many to many edge to be created, thus bloating the catalog
size. This is because that kind of graph will have many edges
reflected in its catalog. So trying to locate a large catalog might be
useful. Not to mention, such a catalog would cause slower compilation
times on the master also :-). Be mindful that we can receive any
catalog at any time without throttling (while we have free threads),
so if we get a few at a time, it could cause a memory bloat. We also
have N backend command processes ordinarily listening to the internal
queue, and if all N are processing large catalogs, that can cause
bloat and potential crash also. The formula for N is usually half the
amount of CPU cores/hyperthreads you have available.

Ordinarily the approach is:

* Find out if you have large catalogs, and see if there is something
going on in the manifest to explain it
* Increase your RAM until it stabilises with your load and expected
concurrent catalog sizes

We hope to fix this in the future. For the http reception we can
reduce this potential by perhaps streaming inbound catalogs onto our
internal queue, instead of having to hold the entire thing in RAM and
decode it, but that may require us to switch out our internal ActiveMQ
since it doesn't support streaming afaik. Not to mention, it makes it
hard to validate inbound commands :-). On the query side, we already
stream results, its just that during the POST to PuppetDB we have
little ability to stream today. Another solution might be to use a
smaller format like msgpack, even though we supported gzipped commands
via HTTP, they get unpackaged in memory, whereas a msgpack/CBOR packet
would consume less RAM while processing - this is only going to reduce
the problem however, with a big enough catalog even a more compact
serialization will still cause issues (not to mention we still decode
it in RAM today, so that copy still consumes the same RAM).

Another solution we have been looking into is to fail early if we
receive a large catalog, and save the whole instance. Alternatively we
could break up the catalog into multiple pieces, but that requires a
more complicated process so is less viable.

ken.

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/CAE4bNTmxugB5u2ZW3_0SDSJ6yO_v8VtC0Efnx_2MRT29bCofaA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to