Thanks for the fast response. I did a quick test and initial results look
promising. When I swapped in the patched version, CPU usage dropped from
~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
more reasonable.

I'll do more testing on Monday but wanted to get you some quick feedback.
Hopefully Mathias will have good results as well.

On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <j...@confluent.io> wrote:

> Mathias, Solon,
>
> We did identify a CPU issue and patched it in
> https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply the
> patch
> in the 0.8.2 branch and see if that addresses the issue?
>
> Thanks,
>
> Jun
>
> On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > We can reproduce this issue, have a theory as to the cause, and are
> working
> > on a fix. Here is the ticket to track it:
> > https://issues.apache.org/jira/browse/KAFKA-1952
> >
> > I would recommend people hold off on 0.8.2 upgrades until we have a
> handle
> > on this.
> >
> > -Jay
> >
> > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com> wrote:
> >
> > > The partitions nearly all have replication factor 2 (a few stray ones
> > have
> > > 1), and our producers use request.required.acks=-1. However, I should
> > note
> > > there were hardly any messages being produced when I did the upgrade
> and
> > > observed the high CPU load.
> > >
> > > I should have time to do some profiling on Monday and will get back to
> > you
> > > with the results.
> > >
> > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Solon,
> > > >
> > > > What's the replication factor you used for those partitions? What's
> the
> > > > producer ack that you used? Also, could you do a bit of profiling on
> > the
> > > > broker to see which methods used the most CPU?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com>
> > wrote:
> > > >
> > > > > I saw a very similar jump in CPU usage when I tried upgrading from
> > > > 0.8.1.1
> > > > > to 0.8.2.0 today in a test environment. The Kafka cluster there is
> > two
> > > > > m1.larges handling 2,000 partitions across 32 topics. CPU usage
> rose
> > > from
> > > > > 40% into the 150%–190% range, and load average from under 1 to over
> > 4.
> > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the
> previous
> > > > > values.
> > > > >
> > > > > If there's more info that would be helpful, please let me know.
> > > > >
> > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg <
> > > > > mathias.soederb...@gmail.com> wrote:
> > > > >
> > > > > > Jun,
> > > > > >
> > > > > > Pardon the radio silence. I booted up a new broker, created a
> topic
> > > > with
> > > > > > three (3) partitions and replication factor one (1) and used the
> > > > > > *kafka-producer-perf-test.sh
> > > > > > *script to generate load (using messages of roughly the same size
> > as
> > > > > ours).
> > > > > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2
> > > > compared
> > > > > > to 0.8.1.1, but that was about it.
> > > > > >
> > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or
> so,
> > > and
> > > > > had
> > > > > > to add an additional broker due to increased load after the
> upgrade
> > > > (note
> > > > > > that the incoming load on the cluster has been pretty much
> > > consistent).
> > > > > > Since the upgrade we've been seeing an 2-3x increase in latency
> as
> > > > well.
> > > > > > I'm considering downgrading to 0.8.1.1 again to see if it
> resolves
> > > our
> > > > > > issues.
> > > > > >
> > > > > > Best regards,
> > > > > > Mathias
> > > > > >
> > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <j...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Mathias,
> > > > > > >
> > > > > > > The new hprof doesn't reveal anything new to me. We did fix the
> > > logic
> > > > > in
> > > > > > > using Purgatory in 0.8.2, which could potentially drive up the
> > CPU
> > > > > usage
> > > > > > a
> > > > > > > bit. To verify that, could you do your test on a single broker
> > > (with
> > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is
> any
> > > > > > > significant difference in cpu usage?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg <
> > > > > > > mathias.soederb...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Jun,
> > > > > > > >
> > > > > > > > I re-ran the hprof test, for about 30 minutes again, for
> > > > 0.8.2.0-rc2
> > > > > > with
> > > > > > > > the same version of snappy that 0.8.1.1 used. Attached the
> > logs.
> > > > > > > > Unfortunately there wasn't any improvement as the node
> running
> > > > > > > 0.8.2.0-rc2
> > > > > > > > still had a higher load and CPU usage.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Mathias
> > > > > > > >
> > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai <
> > > > > > jai.forums2...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
> > > > > > > >> > Jaikiran,
> > > > > > > >> >
> > > > > > > >> > The fix you provided in probably unnecessary. The channel
> > that
> > > > we
> > > > > > use
> > > > > > > in
> > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be
> > blocking.
> > > > So
> > > > > > even
> > > > > > > >> > though the read from the socket is in a loop, each read
> > blocks
> > > > if
> > > > > > > there
> > > > > > > >> is
> > > > > > > >> > no bytes received from the broker. So, that shouldn't
> cause
> > > > extra
> > > > > > CPU
> > > > > > > >> > consumption.
> > > > > > > >> Hi Jun,
> > > > > > > >>
> > > > > > > >> Of course, you are right! I forgot that while reading the
> > thread
> > > > > dump
> > > > > > in
> > > > > > > >> hprof output, one has to be aware that the thread state
> isn't
> > > > shown
> > > > > > and
> > > > > > > >> the thread need not necessarily be doing any CPU activity.
> > > > > > > >>
> > > > > > > >> -Jaikiran
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> >
> > > > > > > >> > Jun
> > > > > > > >> >
> > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg <
> > > > > > > >> > mathias.soederb...@gmail.com> wrote:
> > > > > > > >> >
> > > > > > > >> >> Hi Neha,
> > > > > > > >> >>
> > > > > > > >> >> I sent an e-mail earlier today, but noticed now that it
> > > didn't
> > > > > > > >> actually go
> > > > > > > >> >> through.
> > > > > > > >> >>
> > > > > > > >> >> Anyhow, I've attached two files, one with output from a
> 10
> > > > minute
> > > > > > run
> > > > > > > >> and
> > > > > > > >> >> one with output from a 30 minute run. Realized that
> maybe I
> > > > > > should've
> > > > > > > >> done
> > > > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless.
> > > > > > > >> >>
> > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm
> > seeing
> > > > the
> > > > > > > same
> > > > > > > >> CPU
> > > > > > > >> >> usage as with the beta version (basically pegging all
> > cores).
> > > > If
> > > > > I
> > > > > > > >> manage
> > > > > > > >> >> to find the time I'll do another run with hprof on the
> rc2
> > > > > version
> > > > > > > >> later
> > > > > > > >> >> today.
> > > > > > > >> >>
> > > > > > > >> >> Best regards,
> > > > > > > >> >> Mathias
> > > > > > > >> >>
> > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede <
> > > > > n...@confluent.io
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >> >>
> > > > > > > >> >>> The following should be sufficient
> > > > > > > >> >>>
> > > > > > > >> >>> java
> > > > > > > >> >>>
> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=
> > > > > > > >> >>> y,thread=y,file=kafka.hprof
> > > > > > > >> >>> <classname>
> > > > > > > >> >>>
> > > > > > > >> >>> You would need to start the Kafka server with the
> settings
> > > > above
> > > > > > for
> > > > > > > >> >>> sometime until you observe the problem.
> > > > > > > >> >>>
> > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg <
> > > > > > > >> >>> mathias.soederb...@gmail.com> wrote:
> > > > > > > >> >>>
> > > > > > > >> >>>> Hi Neha,
> > > > > > > >> >>>>
> > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any
> particular
> > > > > > options I
> > > > > > > >> >>> should
> > > > > > > >> >>>> include or just run with defaults?
> > > > > > > >> >>>>
> > > > > > > >> >>>> Best regards,
> > > > > > > >> >>>> Mathias
> > > > > > > >> >>>>
> > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede <
> > > > > > n...@confluent.io>
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind running
> > > hprof
> > > > > and
> > > > > > > >> >>> sending
> > > > > > > >> >>>>> the output?
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg <
> > > > > > > >> >>>>> mathias.soederb...@gmail.com> wrote:
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>> Good day,
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to
> v0.8.2-beta
> > > and
> > > > > > > noticed
> > > > > > > >> >>>> that
> > > > > > > >> >>>>>> the CPU usage on the broker machines went up by
> roughly
> > > > 40%,
> > > > > > from
> > > > > > > >> >>> ~60%
> > > > > > > >> >>>> to
> > > > > > > >> >>>>>> ~100% and am wondering if anyone else has experienced
> > > > > something
> > > > > > > >> >>>> similar?
> > > > > > > >> >>>>>> The load average also went up by 2x-3x.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> We're running on EC2 and the cluster currently
> consists
> > > of
> > > > > four
> > > > > > > >> >>>>> m1.xlarge,
> > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using
> Java
> > 7
> > > > > > > (1.7.0_65
> > > > > > > >> >>> to
> > > > > > > >> >>>> be
> > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found
> > over
> > > > > here:
> > > > > > > >> >>>>>>
> > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> I'm assuming that this is not expected behaviour for
> > > > > > 0.8.2-beta?
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> Best regards,
> > > > > > > >> >>>>>> Mathias
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> --
> > > > > > > >> >>>>> Thanks,
> > > > > > > >> >>>>> Neha
> > > > > > > >> >>>>>
> > > > > > > >> >>>
> > > > > > > >> >>>
> > > > > > > >> >>> --
> > > > > > > >> >>> Thanks,
> > > > > > > >> >>> Neha
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to