> in the meantime I've added RAM and extended the heap to 2GB. But still I'm > getting crashes of PuppetDB. > Last time it was the kernel OOM that killed the java process as I saw in > /var/log/messages > kernel: Out of memory: Kill process 10146 (java) score 158 or sacrifice > child > kernel: Killed process 10146, UID 498, (java) total-vm:3649108kB, > anon-rss:1821580kB, file-rss:72kB
This kind of crash is more to do with the tuning on your Linux instance usually. The OOM killer formula is somewhat tricky, but as a general rule it takes into account the amount of RAM + swap (which most people don't expect). So if your swap is zero, or very low in relation to your memory, you may find the OOM killer is killing processes before your RAM fills up. The thing you want to research is overcommit_ratio, or your swap allocation. There are lots of articles online about this. As a general rule, if you're running with low swap you need the overcommit_ratio to be higher, by default its set to 50% of the total virtual memory ordinarily I think, so if a process tries to allocate memory, and you've exceeded 50% of your overall RAM + swap, OOM killer will kick in. Here's an example from one of my instances, so you can see how to analyze this: root@puppetdb1:~# free total used free shared buffers cached Mem: 2054120 975464 1078656 0 169772 219876 -/+ buffers/cache: 585816 1468304 Swap: 892924 0 892924 root@puppetdb1:~# cat /proc/sys/vm/overcommit_ratio 50 root@puppetdb1:~# So in my case, the total virtual memory available is 2.8 GB (2 GB + 800 MB swap), and if a process tries to allocated more than 50% of it (1.4 GB), the oom killer might kick in. I'm obviously trivialising the whole story for brevity, the OOM killer has a few little quirks that might affect all of this (and quite a few independant tunables), but more often than not if you think you have enough RAM but the oom killer is still killing your processes its somewhere between your swap + overcomit_ratio. I've seen this a bit in virtualised environments, and places that try to launch instances with zero swap as an example. > Does this mean I need to add additional heap space and/or RAM? > I'm looking at the dashboard and it's as if the heap slowly increases. Right > after startup it's 3 of 400 MB, after a day or 1 I'm over 1 GB .... If its not crashing in the JVM any more, I'd focus on the tuning issue above. If the Java instance is still crashing you can try increasing it to see where the stable point is. If its starts to get stupid, and you still don't know why its possible to analyze the heap dumps that get left behind for any clues potentially, but about 90% of the time I find its a large catalog causing it, so I'd focus there first. Heap memory can naturally fluctuate over time ... and yes sometimes it can increase but usually it should be garbage collecting and going up and down, so it depends on what you are monitoring exactly, and where you got the number I guess. > I have no clue in how to find out what exactly is wrong. When I was running > PuppetDB 1.6 I could do with 256MB heap space. > Does anyone have an idea how to investigate what's wrong? Well, it shouldn't have gotten any worse in the later versions afaik, what version are you running now? So here is the common memory-bloat story as I know it. During ingestion of new content, PuppetDB has to hold the new catalogs/facts/reports in memory for a short period of time as it decodes the JSON and stores it in the internal queue ... at that point there are really two copies running around, one is JSON, the other is in the internal Clojure data structure. After that there are command listener threads that process these 'commands', storing them in the database. Sometimes, a very large catalog can cause a problem with memory bloat, and if you happen to receive more than 1 at a time, it can be much worse. At the same time, a very large factset or report can also cause issues, like if you are storing a lot of information in facts for example. But more often than not, its to do with a large catalog, and while most catalogs are sane there are a few cases that bloat them. A combinatorial edge problem can often cause this, so doing something like: File<||>->Package<||> (whereby we are trying to create a relationship between all file & package resources, as an example here) Can cause a many to many edge to be created, thus bloating the catalog size. This is because that kind of graph will have many edges reflected in its catalog. So trying to locate a large catalog might be useful. Not to mention, such a catalog would cause slower compilation times on the master also :-). Be mindful that we can receive any catalog at any time without throttling (while we have free threads), so if we get a few at a time, it could cause a memory bloat. We also have N backend command processes ordinarily listening to the internal queue, and if all N are processing large catalogs, that can cause bloat and potential crash also. The formula for N is usually half the amount of CPU cores/hyperthreads you have available. Ordinarily the approach is: * Find out if you have large catalogs, and see if there is something going on in the manifest to explain it * Increase your RAM until it stabilises with your load and expected concurrent catalog sizes We hope to fix this in the future. For the http reception we can reduce this potential by perhaps streaming inbound catalogs onto our internal queue, instead of having to hold the entire thing in RAM and decode it, but that may require us to switch out our internal ActiveMQ since it doesn't support streaming afaik. Not to mention, it makes it hard to validate inbound commands :-). On the query side, we already stream results, its just that during the POST to PuppetDB we have little ability to stream today. Another solution might be to use a smaller format like msgpack, even though we supported gzipped commands via HTTP, they get unpackaged in memory, whereas a msgpack/CBOR packet would consume less RAM while processing - this is only going to reduce the problem however, with a big enough catalog even a more compact serialization will still cause issues (not to mention we still decode it in RAM today, so that copy still consumes the same RAM). Another solution we have been looking into is to fail early if we receive a large catalog, and save the whole instance. Alternatively we could break up the catalog into multiple pieces, but that requires a more complicated process so is less viable. ken. -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/CAE4bNTmxugB5u2ZW3_0SDSJ6yO_v8VtC0Efnx_2MRT29bCofaA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.