BTW, we tried the following Confluent-recommended settings and one broker crashed after 30 minutes with an out-of-memory error:
-Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 On Sun, Jul 9, 2017 at 8:13 AM, John Yost <hokiege...@gmail.com> wrote: > Hey Everyone, > > When we originally upgraded from 0.9.0.1 to 0.10.0 with the exact same > settings we immediately observed OOM errors. I upped the heap size from 6 > GB to 10 GB and that solved the OOM issue. However, I am now seeing that > the ISR count for all partitions goes from 3 to 1 after about an hour > following broker start. > > Monitoring with jstat it appears that, after about an hour, the young > generation partition stays at or near 100%, at which point the ISR count > for each partition goes from 3 to 1 and remains there. There appears to be > a correlation of high GC activity and replica fetch lag. > > I am thinking that GC pauses are the issue, which is a result of > increasing the memory heap size. But, without increasing the memory heap > size, we get OOM errors. > > Any ideas? There must be a setting somewhere that is causing the memory > heap to fill up in 0.10.0 that did not affect 0.9.0.1. > > Thanks > > --John >