So I’ve been looking at the SSL performance (which potentially also has an issue unrelated to this). I’d noticed strange behaviour with poll() on the new consumer but I wasn’t sure whether this was a bug or a feature. On closer inspection it seems to arise from ConsumerPerformance relying on a call to consumer.poll(100). Not all our tests do it this way i should add.
If you change this to poll(1) you should see reasonable performance ensue (so for 1KB messages I see the performance locally jump from ~10MB/s to ~300MB/s which matches the old consumer). Something like: val records = consumer.poll(100) var records = consumer.poll(0) while(records.isEmpty) records = consumer.poll(1) The problem appears to be that the client returns empty results even when there are messages waiting to be read. This then results in a sleep of the specified timeout and overall performance takes a hit. I’ll dig a little deeper but for now there appears to be a work around, so that’s something :) B > On 28 Aug 2015, at 07:30, Ewen Cheslack-Postava <e...@confluent.io> wrote: > > Tried bisecting, but turns out things were broken for some time. We really > need some system tests in place to avoid letting even new code break for so > long. > > At 49026f11781181c38e9d5edb634be9d27245c961 (May 14th), we went from good > performance -> an error due to broker apparently not accepting the > partition assignment strategy. Since this commit seems to add heartbeats > and the server side code for partition assignment strategies, I assume we > were missing something on the client side and by filling in the server > side, things stopped working. > > On either 84636272422b6379d57d4c5ef68b156edc1c67f8 or > a5b11886df8c7aad0548efd2c7c3dbc579232f03 (July 17th), I am able to run the > perf test again, but it's slow -- ~10MB/s for me vs the 2MB/s Jay was > seeing, but that's still far less than the 600MB/s I saw on the earlier > commits. > > Added this to the new consumer checklist, marked for 0.8.3, and at least > for now assigned to Jason since I think he'll probably be able to sort this > out most quickly: https://issues.apache.org/jira/browse/KAFKA-2486 > > -Ewen > > > On Thu, Aug 27, 2015 at 8:03 PM, Guozhang Wang <wangg...@gmail.com> wrote: > >> 436b7ddc386eb688ba0f12836710f5e4bcaa06c8 is pretty recent and there could >> be some current consumer improvement patches that introduces some >> regression. I would suggest doing a binary search in the log from >> 3f8480ccfb011eb43da774737597c597f703e11b >> (maybe even earlier?) to do a quick check. >> >> Guozhang >> >> On Thu, Aug 27, 2015 at 4:39 PM, Jay Kreps <j...@confluent.io> wrote: >> >>> I think this is likely a regression. The two clients had more or less >>> equivalent performance when we checked in the code (see my post on this >>> earlier in the year). Looks like maybe we broke something up in the >>> interim? >>> >>> On my laptop the new consumer perf seems to have dropped from about >>> ~200MB/sec to about 2MB/sec. >>> >>> -Jay >>> >>> >>> On Thu, Aug 27, 2015 at 4:21 PM, Ewen Cheslack-Postava < >> e...@confluent.io> >>> wrote: >>> >>>> I don't think the commands are really equivalent despite just adding >> the >>>> --new-consumer flag. ConsumerPerformance uses a single thread when >> using >>>> the new consumer (it literally just allocates the consumer, loops until >>>> it's consumed enough, then exits), whereas the old consumer uses a >> bunch >>> of >>>> additional threads. >>>> >>>> To really compare performance, someone would have to think through a >> fair >>>> way to compare them -- the two operate so differently that you'd have >> to >>> be >>>> very careful to get an apples-to-apples comparison. >>>> >>>> By the way, membership in consumer groups should be a lot cheaper with >>> the >>>> new consumer (the ZK coordination issues with lots of consumers aren't >> a >>>> problem since ZK is not involved), so you can probably scale up the >>> number >>>> of consumer threads with little impact. It might be nice to patch the >>>> consumer perf test to respect the # of threads setting, which might be >> a >>>> first step to getting a more reasonable comparison. >>>> >>>> -Ewen >>>> >>>> On Thu, Aug 27, 2015 at 11:25 AM, Poorna Chandra Tejashvi Reddy < >>>> pctre...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> We have built the latest kafka from https://github.com/apache/kafka >>>> based >>>>> on this commit id 436b7ddc386eb688ba0f12836710f5e4bcaa06c8 . >>>>> We ran the performance test on a 3 node kafka cluster. There is a >> huge >>>>> throughput degradation using the new-consumer compared to the regular >>>>> consumer. Below are the numbers that explain the same. >>>>> >>>>> bin/kafka-consumer-perf-test.sh --zookeeper zkIp:2181 --broker-list >>>>> brokerIp:9092 --topics test --messages 5000000 : gives a throughput >> of >>>> 693 >>>>> K >>>>> >>>>> bin/kafka-consumer-perf-test.sh --zookeeper zkIp:2181 --broker-list >>>>> brokerIp:9092 --topics test --messages 5000000 --new-consumer : >> gives a >>>>> throughput of 51k >>>>> >>>>> The whole set up is based on ec2, Kafka brokers running on r3.2x >> large. >>>>> >>>>> Are you guys aware of this performance degradation , do you have a >> JIRA >>>> for >>>>> this, which can be used to track the resolution. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> -Poorna >>>>> >>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> Ewen >>>> >>> >> >> >> >> -- >> -- Guozhang >> > > > > -- > Thanks, > Ewen