Hey Everyone, When we originally upgraded from 0.9.0.1 to 0.10.0 with the exact same settings we immediately observed OOM errors. I upped the heap size from 6 GB to 10 GB and that solved the OOM issue. However, I am now seeing that the ISR count for all partitions goes from 3 to 1 after about an hour following broker start.
Monitoring with jstat it appears that, after about an hour, the young generation partition stays at or near 100%, at which point the ISR count for each partition goes from 3 to 1 and remains there. There appears to be a correlation of high GC activity and replica fetch lag. I am thinking that GC pauses are the issue, which is a result of increasing the memory heap size. But, without increasing the memory heap size, we get OOM errors. Any ideas? There must be a setting somewhere that is causing the memory heap to fill up in 0.10.0 that did not affect 0.9.0.1. Thanks --John