Hey Matt,

Indeed! Ismael mentioned this same link yesterday, tried it this AM, and
this change totally fixed the problem! The manifestation we observed was
not increased CPU usage, but rather a MUCH larger memory heap requirement.
Once I changed log.message.format.version to the version of our clients,
the following occurred:

1. ISRs went to full replication for each partition
2. Memory heap usage went down by a factor of 6
3. Storm throughput went up by a factor of 5

Our cluster looks great now--thanks to you and Ismael for pointing me to
the docs where the config issue was described--much, much appreciated!

--John

On Mon, Jul 10, 2017 at 11:52 AM, Matt Andruff <matt.andr...@gmail.com>
wrote:

> Total shot in the dark but could it be related, this talks about CPU but
> could have an impact on memory as well:
> http://kafka.apache.org/0102/documentation.html#upgrade_10_
> performance_impact
>
> Hope this helps.
>
>
> On Sun, 9 Jul 2017 at 10:45 John Yost <hokiege...@gmail.com> wrote:
>
> > Hey Ismael,
> >
> > Thanks a bunch for responding so quickly--really appreciate the
> follow-up!
> > I will have to get those details tomorrow when I return to the office.
> >
> > Thanks again, will forward details ASAP tomorrow.
> >
> > --John
> >
> > On Sun, Jul 9, 2017 at 10:41 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Hi John,
> > >
> > > We would need more details to be able to help. What is the version of
> > your
> > > producers and consumers, is compression being used (and the compression
> > > type if it is) and what is the broker/topic message format version?
> > >
> > > Ismael
> > >
> > > On Sun, Jul 9, 2017 at 1:13 PM, John Yost <hokiege...@gmail.com>
> wrote:
> > >
> > > > Hey Everyone,
> > > >
> > > > When we originally upgraded from 0.9.0.1 to 0.10.0 with the exact
> same
> > > > settings we immediately observed OOM errors. I upped the heap size
> > from 6
> > > > GB to 10 GB and that solved the OOM issue. However, I am now seeing
> > that
> > > > the ISR count for all partitions goes from 3 to 1 after about an hour
> > > > following broker start.
> > > >
> > > > Monitoring with jstat it appears that, after about an hour, the young
> > > > generation partition stays at or near 100%, at which point the ISR
> > count
> > > > for each partition goes from 3 to 1 and remains there. There appears
> to
> > > be
> > > > a correlation of high GC activity and replica fetch lag.
> > > >
> > > > I am thinking that GC pauses are the issue, which is a result of
> > > increasing
> > > > the memory heap size. But, without increasing the memory heap size,
> we
> > > get
> > > > OOM errors.
> > > >
> > > > Any ideas? There must be a setting somewhere that is causing the
> memory
> > > > heap to fill up in 0.10.0 that did not affect 0.9.0.1.
> > > >
> > > > Thanks
> > > >
> > > > --John
> > > >
> > >
> >
>

Reply via email to