We've been running with the new patch since yesterday, and everything seems
to be working just fine (comparable to the previous patch).

Haven't look into the memory consumption that much though, doesn't look
like it varies too much between the patches.

On Mon Feb 16 2015 at 7:31:34 PM Solon Gordon <so...@knewton.com> wrote:

> I tested the new patch out and am seeing comparable CPU usage to the
> previous patch. As far as I can see, heap usage is also comparable between
> the two patches, though I will say that both look significantly better than
> 0.8.1.1 (~250MB vs. ~1GB).
>
> I'll report back if any new issues come up as I start adding more
> producer/consumer load.
>
> On Sun, Feb 15, 2015 at 6:38 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Solon, Mathias,
> >
> > Thanks for testing this out. I just uploaded a slightly modified patch in
> > https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not
> > improve the latency and CPU usage further, but will potentially improve
> > memory consumption. It would be great if you guys can test the new patch
> > out.
> >
> > Thanks,
> >
> > Jun
> >
> > On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg <
> > mathias.soederb...@gmail.com> wrote:
> >
> > > Jun,
> > >
> > > I updated our brokers earlier today with the mentioned patch. A week
> ago
> > > our brokers used ~380% CPU (out of 400%) quite consistently, and now
> > > they're varying between 250-325% (probably running a bit high right now
> > as
> > > we have some consumers catching up quite some lag), so there's
> definitely
> > > an improvement. The producer latency is still a bit higher than with
> > > 0.8.1.1, but I've been playing a bit with broker configuration as well
> as
> > > producer configuration lately so that probably plays in a bit.
> > >
> > > I'll keep an eye on our metrics, and am going to mess around a bit with
> > > configuration. Right now our traffic load is quite low, so it'd be
> > > interesting to see how this works over the next few days. With that
> said,
> > > we're at the same levels of CPU usage as with 0.8.1.1 (though with an
> > > additional broker), so everything looks pretty great.
> > >
> > > We're using acks = "all" (-1) by the way.
> > >
> > > Best regards,
> > > Mathias
> > >
> > > On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com>
> wrote:
> > >
> > > > Thanks for the fast response. I did a quick test and initial results
> > look
> > > > promising. When I swapped in the patched version, CPU usage dropped
> > from
> > > > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but
> much
> > > > more reasonable.
> > > >
> > > > I'll do more testing on Monday but wanted to get you some quick
> > feedback.
> > > > Hopefully Mathias will have good results as well.
> > > >
> > > > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Mathias, Solon,
> > > > >
> > > > > We did identify a CPU issue and patched it in
> > > > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply
> > the
> > > > > patch
> > > > > in the 0.8.2 branch and see if that addresses the issue?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <jay.kr...@gmail.com>
> > > wrote:
> > > > >
> > > > > > We can reproduce this issue, have a theory as to the cause, and
> are
> > > > > working
> > > > > > on a fix. Here is the ticket to track it:
> > > > > > https://issues.apache.org/jira/browse/KAFKA-1952
> > > > > >
> > > > > > I would recommend people hold off on 0.8.2 upgrades until we
> have a
> > > > > handle
> > > > > > on this.
> > > > > >
> > > > > > -Jay
> > > > > >
> > > > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com
> >
> > > > wrote:
> > > > > >
> > > > > > > The partitions nearly all have replication factor 2 (a few
> stray
> > > ones
> > > > > > have
> > > > > > > 1), and our producers use request.required.acks=-1. However, I
> > > should
> > > > > > note
> > > > > > > there were hardly any messages being produced when I did the
> > > upgrade
> > > > > and
> > > > > > > observed the high CPU load.
> > > > > > >
> > > > > > > I should have time to do some profiling on Monday and will get
> > back
> > > > to
> > > > > > you
> > > > > > > with the results.
> > > > > > >
> > > > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <j...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Solon,
> > > > > > > >
> > > > > > > > What's the replication factor you used for those partitions?
> > > What's
> > > > > the
> > > > > > > > producer ack that you used? Also, could you do a bit of
> > profiling
> > > > on
> > > > > > the
> > > > > > > > broker to see which methods used the most CPU?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <
> > so...@knewton.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I saw a very similar jump in CPU usage when I tried
> upgrading
> > > > from
> > > > > > > > 0.8.1.1
> > > > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster
> > there
> > > > is
> > > > > > two
> > > > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU
> > usage
> > > > > rose
> > > > > > > from
> > > > > > > > > 40% into the 150%–190% range, and load average from under 1
> > to
> > > > over
> > > > > > 4.
> > > > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> > > > > previous
> > > > > > > > > values.
> > > > > > > > >
> > > > > > > > > If there's more info that would be helpful, please let me
> > know.
> > > > > > > > >
> > > > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > > > > > mathias.soederb...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Jun,
> > > > > > > > > >
> > > > > > > > > > Pardon the radio silence. I booted up a new broker,
> > created a
> > > > > topic
> > > > > > > > with
> > > > > > > > > > three (3) partitions and replication factor one (1) and
> > used
> > > > the
> > > > > > > > > > *kafka-producer-perf-test.sh
> > > > > > > > > > *script to generate load (using messages of roughly the
> > same
> > > > size
> > > > > > as
> > > > > > > > > ours).
> > > > > > > > > > There was a slight increase in CPU usage (~5-10%) on
> > > > 0.8.2.0-rc2
> > > > > > > > compared
> > > > > > > > > > to 0.8.1.1, but that was about it.
> > > > > > > > > >
> > > > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this
> week
> > > or
> > > > > so,
> > > > > > > and
> > > > > > > > > had
> > > > > > > > > > to add an additional broker due to increased load after
> the
> > > > > upgrade
> > > > > > > > (note
> > > > > > > > > > that the incoming load on the cluster has been pretty
> much
> > > > > > > consistent).
> > > > > > > > > > Since the upgrade we've been seeing an 2-3x increase in
> > > latency
> > > > > as
> > > > > > > > well.
> > > > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> > > > > resolves
> > > > > > > our
> > > > > > > > > > issues.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Mathias
> > > > > > > > > >
> > > > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <
> j...@confluent.io
> > >
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Mathias,
> > > > > > > > > > >
> > > > > > > > > > > The new hprof doesn't reveal anything new to me. We did
> > fix
> > > > the
> > > > > > > logic
> > > > > > > > > in
> > > > > > > > > > > using Purgatory in 0.8.2, which could potentially drive
> > up
> > > > the
> > > > > > CPU
> > > > > > > > > usage
> > > > > > > > > > a
> > > > > > > > > > > bit. To verify that, could you do your test on a single
> > > > broker
> > > > > > > (with
> > > > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if
> > there
> > > is
> > > > > any
> > > > > > > > > > > significant difference in cpu usage?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > > > > > mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again,
> > for
> > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > with
> > > > > > > > > > > > the same version of snappy that 0.8.1.1 used.
> Attached
> > > the
> > > > > > logs.
> > > > > > > > > > > > Unfortunately there wasn't any improvement as the
> node
> > > > > running
> > > > > > > > > > > 0.8.2.0-rc2
> > > > > > > > > > > > still had a higher load and CPU usage.
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Mathias
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > > > > > jai.forums2...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > > > > > >> > Jaikiran,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > The fix you provided in probably unnecessary. The
> > > > channel
> > > > > > that
> > > > > > > > we
> > > > > > > > > > use
> > > > > > > > > > > in
> > > > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to
> be
> > > > > > blocking.
> > > > > > > > So
> > > > > > > > > > even
> > > > > > > > > > > >> > though the read from the socket is in a loop, each
> > > read
> > > > > > blocks
> > > > > > > > if
> > > > > > > > > > > there
> > > > > > > > > > > >> is
> > > > > > > > > > > >> > no bytes received from the broker. So, that
> > shouldn't
> > > > > cause
> > > > > > > > extra
> > > > > > > > > > CPU
> > > > > > > > > > > >> > consumption.
> > > > > > > > > > > >> Hi Jun,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Of course, you are right! I forgot that while
> reading
> > > the
> > > > > > thread
> > > > > > > > > dump
> > > > > > > > > > in
> > > > > > > > > > > >> hprof output, one has to be aware that the thread
> > state
> > > > > isn't
> > > > > > > > shown
> > > > > > > > > > and
> > > > > > > > > > > >> the thread need not necessarily be doing any CPU
> > > activity.
> > > > > > > > > > > >>
> > > > > > > > > > > >> -Jaikiran
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Jun
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > > >> > mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> Hi Neha,
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now
> > that
> > > it
> > > > > > > didn't
> > > > > > > > > > > >> actually go
> > > > > > > > > > > >> >> through.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Anyhow, I've attached two files, one with output
> > > from a
> > > > > 10
> > > > > > > > minute
> > > > > > > > > > run
> > > > > > > > > > > >> and
> > > > > > > > > > > >> >> one with output from a 30 minute run. Realized
> that
> > > > > maybe I
> > > > > > > > > > should've
> > > > > > > > > > > >> done
> > > > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but
> > > nevertheless.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2,
> and
> > > I'm
> > > > > > seeing
> > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > >> CPU
> > > > > > > > > > > >> >> usage as with the beta version (basically pegging
> > all
> > > > > > cores).
> > > > > > > > If
> > > > > > > > > I
> > > > > > > > > > > >> manage
> > > > > > > > > > > >> >> to find the time I'll do another run with hprof
> on
> > > the
> > > > > rc2
> > > > > > > > > version
> > > > > > > > > > > >> later
> > > > > > > > > > > >> >> today.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Best regards,
> > > > > > > > > > > >> >> Mathias
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > > > > > n...@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >>> The following should be sufficient
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> java
> > > > > > > > > > > >> >>>
> > > > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > > > > > >> >>> <classname>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> You would need to start the Kafka server with
> the
> > > > > settings
> > > > > > > > above
> > > > > > > > > > for
> > > > > > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias
> Söderberg
> > <
> > > > > > > > > > > >> >>> mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>>> Hi Neha,
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> > > > > particular
> > > > > > > > > > options I
> > > > > > > > > > > >> >>> should
> > > > > > > > > > > >> >>>> include or just run with defaults?
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> Best regards,
> > > > > > > > > > > >> >>>> Mathias
> > > > > > > > > > > >> >>>>
> > > > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede
> <
> > > > > > > > > > n...@confluent.io>
> > > > > > > > > > > >> >>> wrote:
> > > > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind
> > > > running
> > > > > > > hprof
> > > > > > > > > and
> > > > > > > > > > > >> >>> sending
> > > > > > > > > > > >> >>>>> the output?
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias
> > Söderberg
> > > <
> > > > > > > > > > > >> >>>>> mathias.soederb...@gmail.com> wrote:
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>>> Good day,
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> > > > > v0.8.2-beta
> > > > > > > and
> > > > > > > > > > > noticed
> > > > > > > > > > > >> >>>> that
> > > > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up
> by
> > > > > roughly
> > > > > > > > 40%,
> > > > > > > > > > from
> > > > > > > > > > > >> >>> ~60%
> > > > > > > > > > > >> >>>> to
> > > > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has
> > > > experienced
> > > > > > > > > something
> > > > > > > > > > > >> >>>> similar?
> > > > > > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> We're running on EC2 and the cluster
> currently
> > > > > consists
> > > > > > > of
> > > > > > > > > four
> > > > > > > > > > > >> >>>>> m1.xlarge,
> > > > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions.
> > Using
> > > > > Java
> > > > > > 7
> > > > > > > > > > > (1.7.0_65
> > > > > > > > > > > >> >>> to
> > > > > > > > > > > >> >>>> be
> > > > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be
> > > found
> > > > > > over
> > > > > > > > > here:
> > > > > > > > > > > >> >>>>>>
> > > > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> I'm assuming that this is not expected
> > behaviour
> > > > for
> > > > > > > > > > 0.8.2-beta?
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>> Best regards,
> > > > > > > > > > > >> >>>>>> Mathias
> > > > > > > > > > > >> >>>>>>
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>>> --
> > > > > > > > > > > >> >>>>> Thanks,
> > > > > > > > > > > >> >>>>> Neha
> > > > > > > > > > > >> >>>>>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >> >>> --
> > > > > > > > > > > >> >>> Thanks,
> > > > > > > > > > > >> >>> Neha
> > > > > > > > > > > >> >>>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to