I'm checking into this on our side. The version we're working on jumping to right now is not the 0.8.2 release version, but it is significantly ahead of 0.8.1.1. We've got it deployed on one cluster and I'm making sure it's balanced right now before I take a look at all the metrics. I'll fill in more detail once I have it.
-Todd On Thu, Feb 12, 2015 at 9:51 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > This is a serious issue, we'll take a look. > > -Jay > > On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon <so...@knewton.com> wrote: > > > I saw a very similar jump in CPU usage when I tried upgrading from > 0.8.1.1 > > to 0.8.2.0 today in a test environment. The Kafka cluster there is two > > m1.larges handling 2,000 partitions across 32 topics. CPU usage rose from > > 40% into the 150%–190% range, and load average from under 1 to over 4. > > Downgrading to 0.8.1.1 brought the CPU and load back to the previous > > values. > > > > If there's more info that would be helpful, please let me know. > > > > On Thu, Feb 12, 2015 at 4:17 PM, Mathias Söderberg < > > mathias.soederb...@gmail.com> wrote: > > > > > Jun, > > > > > > Pardon the radio silence. I booted up a new broker, created a topic > with > > > three (3) partitions and replication factor one (1) and used the > > > *kafka-producer-perf-test.sh > > > *script to generate load (using messages of roughly the same size as > > ours). > > > There was a slight increase in CPU usage (~5-10%) on 0.8.2.0-rc2 > compared > > > to 0.8.1.1, but that was about it. > > > > > > I upgraded our staging cluster to 0.8.2.0 earlier this week or so, and > > had > > > to add an additional broker due to increased load after the upgrade > (note > > > that the incoming load on the cluster has been pretty much consistent). > > > Since the upgrade we've been seeing an 2-3x increase in latency as > well. > > > I'm considering downgrading to 0.8.1.1 again to see if it resolves our > > > issues. > > > > > > Best regards, > > > Mathias > > > > > > On Tue Feb 03 2015 at 6:44:36 PM Jun Rao <j...@confluent.io> wrote: > > > > > > > Mathias, > > > > > > > > The new hprof doesn't reveal anything new to me. We did fix the logic > > in > > > > using Purgatory in 0.8.2, which could potentially drive up the CPU > > usage > > > a > > > > bit. To verify that, could you do your test on a single broker (with > > > > replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any > > > > significant difference in cpu usage? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Tue, Feb 3, 2015 at 5:09 AM, Mathias Söderberg < > > > > mathias.soederb...@gmail.com> wrote: > > > > > > > > > Jun, > > > > > > > > > > I re-ran the hprof test, for about 30 minutes again, for > 0.8.2.0-rc2 > > > with > > > > > the same version of snappy that 0.8.1.1 used. Attached the logs. > > > > > Unfortunately there wasn't any improvement as the node running > > > > 0.8.2.0-rc2 > > > > > still had a higher load and CPU usage. > > > > > > > > > > Best regards, > > > > > Mathias > > > > > > > > > > On Tue Feb 03 2015 at 4:40:31 AM Jaikiran Pai < > > > jai.forums2...@gmail.com> > > > > > wrote: > > > > > > > > > >> On Monday 02 February 2015 11:03 PM, Jun Rao wrote: > > > > >> > Jaikiran, > > > > >> > > > > > >> > The fix you provided in probably unnecessary. The channel that > we > > > use > > > > in > > > > >> > SimpleConsumer (BlockingChannel) is configured to be blocking. > So > > > even > > > > >> > though the read from the socket is in a loop, each read blocks > if > > > > there > > > > >> is > > > > >> > no bytes received from the broker. So, that shouldn't cause > extra > > > CPU > > > > >> > consumption. > > > > >> Hi Jun, > > > > >> > > > > >> Of course, you are right! I forgot that while reading the thread > > dump > > > in > > > > >> hprof output, one has to be aware that the thread state isn't > shown > > > and > > > > >> the thread need not necessarily be doing any CPU activity. > > > > >> > > > > >> -Jaikiran > > > > >> > > > > >> > > > > >> > > > > > >> > Thanks, > > > > >> > > > > > >> > Jun > > > > >> > > > > > >> > On Mon, Jan 26, 2015 at 10:05 AM, Mathias Söderberg < > > > > >> > mathias.soederb...@gmail.com> wrote: > > > > >> > > > > > >> >> Hi Neha, > > > > >> >> > > > > >> >> I sent an e-mail earlier today, but noticed now that it didn't > > > > >> actually go > > > > >> >> through. > > > > >> >> > > > > >> >> Anyhow, I've attached two files, one with output from a 10 > minute > > > run > > > > >> and > > > > >> >> one with output from a 30 minute run. Realized that maybe I > > > should've > > > > >> done > > > > >> >> one or two runs with 0.8.1.1 as well, but nevertheless. > > > > >> >> > > > > >> >> I upgraded our staging cluster to 0.8.2.0-rc2, and I'm seeing > the > > > > same > > > > >> CPU > > > > >> >> usage as with the beta version (basically pegging all cores). > If > > I > > > > >> manage > > > > >> >> to find the time I'll do another run with hprof on the rc2 > > version > > > > >> later > > > > >> >> today. > > > > >> >> > > > > >> >> Best regards, > > > > >> >> Mathias > > > > >> >> > > > > >> >> On Tue Dec 09 2014 at 10:08:21 PM Neha Narkhede < > > n...@confluent.io > > > > > > > > >> wrote: > > > > >> >> > > > > >> >>> The following should be sufficient > > > > >> >>> > > > > >> >>> java > > > > >> >>> -agentlib:hprof=cpu=samples,depth=100,interval=20,lineno= > > > > >> >>> y,thread=y,file=kafka.hprof > > > > >> >>> <classname> > > > > >> >>> > > > > >> >>> You would need to start the Kafka server with the settings > above > > > for > > > > >> >>> sometime until you observe the problem. > > > > >> >>> > > > > >> >>> On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg < > > > > >> >>> mathias.soederb...@gmail.com> wrote: > > > > >> >>> > > > > >> >>>> Hi Neha, > > > > >> >>>> > > > > >> >>>> Yeah sure. I'm not familiar with hprof, so any particular > > > options I > > > > >> >>> should > > > > >> >>>> include or just run with defaults? > > > > >> >>>> > > > > >> >>>> Best regards, > > > > >> >>>> Mathias > > > > >> >>>> > > > > >> >>>> On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede < > > > n...@confluent.io> > > > > >> >>> wrote: > > > > >> >>>>> Thanks for reporting the issue. Would you mind running hprof > > and > > > > >> >>> sending > > > > >> >>>>> the output? > > > > >> >>>>> > > > > >> >>>>> On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg < > > > > >> >>>>> mathias.soederb...@gmail.com> wrote: > > > > >> >>>>> > > > > >> >>>>>> Good day, > > > > >> >>>>>> > > > > >> >>>>>> I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and > > > > noticed > > > > >> >>>> that > > > > >> >>>>>> the CPU usage on the broker machines went up by roughly > 40%, > > > from > > > > >> >>> ~60% > > > > >> >>>> to > > > > >> >>>>>> ~100% and am wondering if anyone else has experienced > > something > > > > >> >>>> similar? > > > > >> >>>>>> The load average also went up by 2x-3x. > > > > >> >>>>>> > > > > >> >>>>>> We're running on EC2 and the cluster currently consists of > > four > > > > >> >>>>> m1.xlarge, > > > > >> >>>>>> with roughly 1100 topics / 4000 partitions. Using Java 7 > > > > (1.7.0_65 > > > > >> >>> to > > > > >> >>>> be > > > > >> >>>>>> exact) and Scala 2.9.2. Configurations can be found over > > here: > > > > >> >>>>>> https://gist.github.com/mthssdrbrg/7df34a795e07eef10262. > > > > >> >>>>>> > > > > >> >>>>>> I'm assuming that this is not expected behaviour for > > > 0.8.2-beta? > > > > >> >>>>>> > > > > >> >>>>>> Best regards, > > > > >> >>>>>> Mathias > > > > >> >>>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> -- > > > > >> >>>>> Thanks, > > > > >> >>>>> Neha > > > > >> >>>>> > > > > >> >>> > > > > >> >>> > > > > >> >>> -- > > > > >> >>> Thanks, > > > > >> >>> Neha > > > > >> >>> > > > > >> > > > > >> > > > > > > > > > >