Total shot in the dark but could it be related, this talks about CPU but could have an impact on memory as well: http://kafka.apache.org/0102/documentation.html#upgrade_10_performance_impact
Hope this helps. On Sun, 9 Jul 2017 at 10:45 John Yost <hokiege...@gmail.com> wrote: > Hey Ismael, > > Thanks a bunch for responding so quickly--really appreciate the follow-up! > I will have to get those details tomorrow when I return to the office. > > Thanks again, will forward details ASAP tomorrow. > > --John > > On Sun, Jul 9, 2017 at 10:41 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Hi John, > > > > We would need more details to be able to help. What is the version of > your > > producers and consumers, is compression being used (and the compression > > type if it is) and what is the broker/topic message format version? > > > > Ismael > > > > On Sun, Jul 9, 2017 at 1:13 PM, John Yost <hokiege...@gmail.com> wrote: > > > > > Hey Everyone, > > > > > > When we originally upgraded from 0.9.0.1 to 0.10.0 with the exact same > > > settings we immediately observed OOM errors. I upped the heap size > from 6 > > > GB to 10 GB and that solved the OOM issue. However, I am now seeing > that > > > the ISR count for all partitions goes from 3 to 1 after about an hour > > > following broker start. > > > > > > Monitoring with jstat it appears that, after about an hour, the young > > > generation partition stays at or near 100%, at which point the ISR > count > > > for each partition goes from 3 to 1 and remains there. There appears to > > be > > > a correlation of high GC activity and replica fetch lag. > > > > > > I am thinking that GC pauses are the issue, which is a result of > > increasing > > > the memory heap size. But, without increasing the memory heap size, we > > get > > > OOM errors. > > > > > > Any ideas? There must be a setting somewhere that is causing the memory > > > heap to fill up in 0.10.0 that did not affect 0.9.0.1. > > > > > > Thanks > > > > > > --John > > > > > >