I tested the new patch out and am seeing comparable CPU usage to the previous patch. As far as I can see, heap usage is also comparable between the two patches, though I will say that both look significantly better than 0.8.1.1 (~250MB vs. ~1GB).
I'll report back if any new issues come up as I start adding more producer/consumer load. On Sun, Feb 15, 2015 at 6:38 PM, Jun Rao <j...@confluent.io> wrote: > Solon, Mathias, > > Thanks for testing this out. I just uploaded a slightly modified patch in > https://issues.apache.org/jira/browse/KAFKA-1952. The new patch may not > improve the latency and CPU usage further, but will potentially improve > memory consumption. It would be great if you guys can test the new patch > out. > > Thanks, > > Jun > > On Sat, Feb 14, 2015 at 9:08 AM, Mathias Söderberg < > mathias.soederb...@gmail.com> wrote: > > > Jun, > > > > I updated our brokers earlier today with the mentioned patch. A week ago > > our brokers used ~380% CPU (out of 400%) quite consistently, and now > > they're varying between 250-325% (probably running a bit high right now > as > > we have some consumers catching up quite some lag), so there's definitely > > an improvement. The producer latency is still a bit higher than with > > 0.8.1.1, but I've been playing a bit with broker configuration as well as > > producer configuration lately so that probably plays in a bit. > > > > I'll keep an eye on our metrics, and am going to mess around a bit with > > configuration. Right now our traffic load is quite low, so it'd be > > interesting to see how this works over the next few days. With that said, > > we're at the same levels of CPU usage as with 0.8.1.1 (though with an > > additional broker), so everything looks pretty great. > > > > We're using acks = "all" (-1) by the way. > > > > Best regards, > > Mathias > > > > On Sat Feb 14 2015 at 4:40:31 AM Solon Gordon <so...@knewton.com> wrote: > > > > > Thanks for the fast response. I did a quick test and initial results > look > > > promising. When I swapped in the patched version, CPU usage dropped > from > > > ~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much > > > more reasonable. > > > > > > I'll do more testing on Monday but wanted to get you some quick > feedback. > > > Hopefully Mathias will have good results as well. > > > > > > On Fri, Feb 13, 2015 at 9:14 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > Mathias, Solon, > > > > > > > > We did identify a CPU issue and patched it in > > > > https://issues.apache.org/jira/browse/KAFKA-1952. Could you apply > the > > > > patch > > > > in the 0.8.2 branch and see if that addresses the issue? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Fri, Feb 13, 2015 at 3:26 PM, Jay Kreps <jay.kr...@gmail.com> > > wrote: > > > > > > > > > We can reproduce this issue, have a theory as to the cause, and are > > > > working > > > > > on a fix. Here is the ticket to track it: > > > > > https://issues.apache.org/jira/browse/KAFKA-1952 > > > > > > > > > > I would recommend people hold off on 0.8.2 upgrades until we have a > > > > handle > > > > > on this. > > > > > > > > > > -Jay > > > > > > > > > > On Fri, Feb 13, 2015 at 1:47 PM, Solon Gordon <so...@knewton.com> > > > wrote: > > > > > > > > > > > The partitions nearly all have replication factor 2 (a few stray > > ones > > > > > have > > > > > > 1), and our producers use request.required.acks=-1. However, I > > should > > > > > note > > > > > > there were hardly any messages being produced when I did the > > upgrade > > > > and > > > > > > observed the high CPU load. > > > > > > > > > > > > I should have time to do some profiling on Monday and will get > back > > > to > > > > > you > > > > > > with the results. > > > > > > > > > > > > On Fri, Feb 13, 2015 at 1:00 PM, Jun Rao <j...@confluent.io> > wrote: > > > > > > > > > > > > > Solon, > > > > > > > > > > > > > > What's the replication factor you used for those partitions? > > What's > > > > the > > > > > > > producer ack that you used? Also, could you do a bit of > profiling > > > on > > > > > the > > > > > > > broker to see which methods used the most CPU? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon < > so...@knewton.com > > > > > > > > wrote: > > > > > > > > > > > > > > > I saw a very similar jump in CPU usage when I tried upgrading > > > from > > > > > > > 0.8.1.1 > > > > > > > > to 0.8.2.0 today in a test environment. The Kafka cluster > there > > > is > > > > > two > > > > > > > > m1.larges handling 2,000 partitions across 32 topics. CPU > usage > > > > rose > > > > > > from > > > > > > > > 40% into the 150%–190% range, and load average from under 1 > to > > > over > > > > > 4. > > > > > > > > Downgrading to 0.8.1.1 brought the CPU and load back to the > > > > previous > > > > > > > > values. > > > > > > > > > > > > > > > > If there's more info that would be helpful, please let me > know. > > > > > > > > > > > > > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg < > > > > > > > > mathias.soederb...@gmail.com> wrote: > > > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > > > Pardon the radio silence. I booted up a new broker, > created a > > > > topic > > > > > > > with > > > > > > > > > three (3) partitions and replication factor one (1) and > used > > > the > > > > > > > > > *kafka-producer-perf-test.sh > > > > > > > > > *script to generate load (using messages of roughly the > same > > > size > > > > > as > > > > > > > > ours). > > > > > > > > > There was a slight increase in CPU usage (~5-10%) on > > > 0.8.2.0-rc2 > > > > > > > compared > > > > > > > > > to 0.8.1.1, but that was about it. > > > > > > > > > > > > > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week > > or > > > > so, > > > > > > and > > > > > > > > had > > > > > > > > > to add an additional broker due to increased load after the > > > > upgrade > > > > > > > (note > > > > > > > > > that the incoming load on the cluster has been pretty much > > > > > > consistent). > > > > > > > > > Since the upgrade we've been seeing an 2-3x increase in > > latency > > > > as > > > > > > > well. > > > > > > > > > I'm considering downgrading to 0.8.1.1 again to see if it > > > > resolves > > > > > > our > > > > > > > > > issues. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Mathias > > > > > > > > > > > > > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <j...@confluent.io > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Mathias, > > > > > > > > > > > > > > > > > > > > The new hprof doesn't reveal anything new to me. We did > fix > > > the > > > > > > logic > > > > > > > > in > > > > > > > > > > using Purgatory in 0.8.2, which could potentially drive > up > > > the > > > > > CPU > > > > > > > > usage > > > > > > > > > a > > > > > > > > > > bit. To verify that, could you do your test on a single > > > broker > > > > > > (with > > > > > > > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if > there > > is > > > > any > > > > > > > > > > significant difference in cpu usage? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg < > > > > > > > > > > mathias.soederb...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again, > for > > > > > > > 0.8.2.0-rc2 > > > > > > > > > with > > > > > > > > > > > the same version of snappy that 0.8.1.1 used. Attached > > the > > > > > logs. > > > > > > > > > > > Unfortunately there wasn't any improvement as the node > > > > running > > > > > > > > > > 0.8.2.0-rc2 > > > > > > > > > > > still had a higher load and CPU usage. > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Mathias > > > > > > > > > > > > > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai < > > > > > > > > > jai.forums2...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote: > > > > > > > > > > >> > Jaikiran, > > > > > > > > > > >> > > > > > > > > > > > >> > The fix you provided in probably unnecessary. The > > > channel > > > > > that > > > > > > > we > > > > > > > > > use > > > > > > > > > > in > > > > > > > > > > >> > SimpleConsumer (BlockingChannel) is configured to be > > > > > blocking. > > > > > > > So > > > > > > > > > even > > > > > > > > > > >> > though the read from the socket is in a loop, each > > read > > > > > blocks > > > > > > > if > > > > > > > > > > there > > > > > > > > > > >> is > > > > > > > > > > >> > no bytes received from the broker. So, that > shouldn't > > > > cause > > > > > > > extra > > > > > > > > > CPU > > > > > > > > > > >> > consumption. > > > > > > > > > > >> Hi Jun, > > > > > > > > > > >> > > > > > > > > > > >> Of course, you are right! I forgot that while reading > > the > > > > > thread > > > > > > > > dump > > > > > > > > > in > > > > > > > > > > >> hprof output, one has to be aware that the thread > state > > > > isn't > > > > > > > shown > > > > > > > > > and > > > > > > > > > > >> the thread need not necessarily be doing any CPU > > activity. > > > > > > > > > > >> > > > > > > > > > > >> -Jaikiran > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > >> > Thanks, > > > > > > > > > > >> > > > > > > > > > > > >> > Jun > > > > > > > > > > >> > > > > > > > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg > < > > > > > > > > > > >> > mathias.soederb...@gmail.com> wrote: > > > > > > > > > > >> > > > > > > > > > > > >> >> Hi Neha, > > > > > > > > > > >> >> > > > > > > > > > > >> >> I sent an e-mail earlier today, but noticed now > that > > it > > > > > > didn't > > > > > > > > > > >> actually go > > > > > > > > > > >> >> through. > > > > > > > > > > >> >> > > > > > > > > > > >> >> Anyhow, I've attached two files, one with output > > from a > > > > 10 > > > > > > > minute > > > > > > > > > run > > > > > > > > > > >> and > > > > > > > > > > >> >> one with output from a 30 minute run. Realized that > > > > maybe I > > > > > > > > > should've > > > > > > > > > > >> done > > > > > > > > > > >> >> one or two runs with 0.8.1.1 as well, but > > nevertheless. > > > > > > > > > > >> >> > > > > > > > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and > > I'm > > > > > seeing > > > > > > > the > > > > > > > > > > same > > > > > > > > > > >> CPU > > > > > > > > > > >> >> usage as with the beta version (basically pegging > all > > > > > cores). > > > > > > > If > > > > > > > > I > > > > > > > > > > >> manage > > > > > > > > > > >> >> to find the time I'll do another run with hprof on > > the > > > > rc2 > > > > > > > > version > > > > > > > > > > >> later > > > > > > > > > > >> >> today. > > > > > > > > > > >> >> > > > > > > > > > > >> >> Best regards, > > > > > > > > > > >> >> Mathias > > > > > > > > > > >> >> > > > > > > > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede < > > > > > > > > n...@confluent.io > > > > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > > >> >> > > > > > > > > > > >> >>> The following should be sufficient > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> java > > > > > > > > > > >> >>> > > > > -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno= > > > > > > > > > > >> >>> y,thread=y,file=kafka.hprof > > > > > > > > > > >> >>> <classname> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> You would need to start the Kafka server with the > > > > settings > > > > > > > above > > > > > > > > > for > > > > > > > > > > >> >>> sometime until you observe the problem. > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg > < > > > > > > > > > > >> >>> mathias.soederb...@gmail.com> wrote: > > > > > > > > > > >> >>> > > > > > > > > > > >> >>>> Hi Neha, > > > > > > > > > > >> >>>> > > > > > > > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any > > > > particular > > > > > > > > > options I > > > > > > > > > > >> >>> should > > > > > > > > > > >> >>>> include or just run with defaults? > > > > > > > > > > >> >>>> > > > > > > > > > > >> >>>> Best regards, > > > > > > > > > > >> >>>> Mathias > > > > > > > > > > >> >>>> > > > > > > > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede < > > > > > > > > > n...@confluent.io> > > > > > > > > > > >> >>> wrote: > > > > > > > > > > >> >>>>> Thanks for reporting the issue. Would you mind > > > running > > > > > > hprof > > > > > > > > and > > > > > > > > > > >> >>> sending > > > > > > > > > > >> >>>>> the output? > > > > > > > > > > >> >>>>> > > > > > > > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias > Söderberg > > < > > > > > > > > > > >> >>>>> mathias.soederb...@gmail.com> wrote: > > > > > > > > > > >> >>>>> > > > > > > > > > > >> >>>>>> Good day, > > > > > > > > > > >> >>>>>> > > > > > > > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to > > > > v0.8.2-beta > > > > > > and > > > > > > > > > > noticed > > > > > > > > > > >> >>>> that > > > > > > > > > > >> >>>>>> the CPU usage on the broker machines went up by > > > > roughly > > > > > > > 40%, > > > > > > > > > from > > > > > > > > > > >> >>> ~60% > > > > > > > > > > >> >>>> to > > > > > > > > > > >> >>>>>> ~100% and am wondering if anyone else has > > > experienced > > > > > > > > something > > > > > > > > > > >> >>>> similar? > > > > > > > > > > >> >>>>>> The load average also went up by 2x-3x. > > > > > > > > > > >> >>>>>> > > > > > > > > > > >> >>>>>> We're running on EC2 and the cluster currently > > > > consists > > > > > > of > > > > > > > > four > > > > > > > > > > >> >>>>> m1.xlarge, > > > > > > > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. > Using > > > > Java > > > > > 7 > > > > > > > > > > (1.7.0_65 > > > > > > > > > > >> >>> to > > > > > > > > > > >> >>>> be > > > > > > > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be > > found > > > > > over > > > > > > > > here: > > > > > > > > > > >> >>>>>> > > > > > https://gist.github.com/mthssdrbrg/7df34a795e07eef10262. > > > > > > > > > > >> >>>>>> > > > > > > > > > > >> >>>>>> I'm assuming that this is not expected > behaviour > > > for > > > > > > > > > 0.8.2-beta? > > > > > > > > > > >> >>>>>> > > > > > > > > > > >> >>>>>> Best regards, > > > > > > > > > > >> >>>>>> Mathias > > > > > > > > > > >> >>>>>> > > > > > > > > > > >> >>>>> > > > > > > > > > > >> >>>>> > > > > > > > > > > >> >>>>> -- > > > > > > > > > > >> >>>>> Thanks, > > > > > > > > > > >> >>>>> Neha > > > > > > > > > > >> >>>>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> -- > > > > > > > > > > >> >>> Thanks, > > > > > > > > > > >> >>> Neha > > > > > > > > > > >> >>> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >