Re: Random Partitioning Issue

Joe Stein Wed, 18 Sep 2013 16:24:32 -0700

Sounds good, I will create a JIRA and upload a patch.


/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop
********************************************/


On Sep 17, 2013, at 1:19 PM, Joel Koshy <jjkosh...@gmail.com> wrote:

> I agree that minimizing the number of producer connections (while
> being a good thing) is really required in very large production
> deployments, and the net-effect of the existing change is
> counter-intuitive to users who expect an immediate even distribution
> across _all_ partitions of the topic.
> 
> However, I don't think it is a hack because it is almost exactly the
> same behavior as 0.7 in one of its modes. The 0.7 producer (which I
> think was even more confusing) had three modes:
> i) ZK send
> ii) Config send(a): static list of broker1:port1,broker2:port2,etc.
> iii) Config send(b): static list of a hardwareVIP:VIPport
> 
> (i) and (ii) would achieve even distribution. (iii) would effectively
> select one broker and distribute to partitions on that broker within
> each reconnect interval. (iii) is very similar to what we now do in
> 0.8. (Although we stick to one partition during each metadata refresh
> interval that can be changed to stick to one broker and distribute
> across partitions on that broker).
> 
> At the same time, I agree with Joe's suggestion that we should keep
> the more intuitive pre-KAFKA-1017 behavior as the default and move the
> change in KAFKA-1017 to a more specific partitioner implementation.
> 
> Joel
> 
> 
> On Sun, Sep 15, 2013 at 8:44 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>> Let me ask another question which I think is more objective. Let's say 100
>> random, smart infrastructure specialists try Kafka, of these 100 how many
>> do you believe will
>> 1. Say that this behavior is what they expected to happen?
>> 2. Be happy with this behavior?
>> I am not being facetious I am genuinely looking for a numerical estimate. I
>> am trying to figure out if nobody thought about this or if my estimate is
>> just really different. For what it is worth my estimate is 0 and 5
>> respectively.
>> 
>> This would be fine expect that we changed it from the good behavior to the
>> bad behavior to fix an issue that probably only we have.
>> 
>> -Jay
>> 
>> 
>> On Sun, Sep 15, 2013 at 8:37 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>> 
>>> I just took a look at this change. I agree with Joe, not to put to fine a
>>> point on it, but this is a confusing hack.
>>> 
>>> Jun, I don't think wanting to minimizing the number of TCP connections is
>>> going to be a very common need for people with less than 10k producers. I
>>> also don't think people are going to get very good load balancing out of
>>> this because most people don't have a ton of producers. I think instead we
>>> will spend the next year explaining this behavior which 99% of people will
>>> think is a bug (because it is crazy, non-intuitive, and breaks their usage).
>>> 
>>> Why was this done by adding special default behavior in the null key case
>>> instead of as a partitioner? The argument that the partitioner interface
>>> doesn't have sufficient information to choose a partition is not a good
>>> argument for hacking in changes to the default, it is an argument for *
>>> improving* the partitioner interface.
>>> 
>>> The whole point of a partitioner interface is to make it possible to plug
>>> in non-standard behavior like this, right?
>>> 
>>> -Jay
>>> 
>>> 
>>> On Sat, Sep 14, 2013 at 8:15 PM, Jun Rao <jun...@gmail.com> wrote:
>>> 
>>>> Joe,
>>>> 
>>>> Thanks for bringing this up. I want to clarify this a bit.
>>>> 
>>>> 1. Currently, the producer side logic is that if the partitioning key is
>>>> not provided (i.e., it is null), the partitioner won't be called. We did
>>>> that because we want to select a random and "available" partition to send
>>>> messages so that if some partitions are temporarily unavailable (because
>>>> of
>>>> broker failures), messages can still be sent to other partitions. Doing
>>>> this in the partitioner is difficult since the partitioner doesn't know
>>>> which partitions are currently available (the DefaultEventHandler does).
>>>> 
>>>> 2. As Joel said, the common use case in production is that there are many
>>>> more producers than #partitions in a topic. In this case, sticking to a
>>>> partition for a few minutes is not going to cause too much imbalance in
>>>> the
>>>> partitions and has the benefit of reducing the # of socket connections. My
>>>> feeling is that this will benefit most production users. In fact, if one
>>>> uses a hardware load balancer for producing data in 0.7, it behaves in
>>>> exactly the same way (a producer will stick to a broker until the
>>>> reconnect
>>>> interval is reached).
>>>> 
>>>> 3. It is true that If one is testing a topic with more than one partition
>>>> (which is not the default value), this behavior can be a bit weird.
>>>> However, I think it can be mitigated by running multiple test producer
>>>> instances.
>>>> 
>>>> 4. Someone reported in the mailing list that all data shows in only one
>>>> partition after a few weeks. This is clearly not the expected behavior. We
>>>> can take a closer look to see if this is real issue.
>>>> 
>>>> Do you think these address your concerns?
>>>> 
>>>> Thanks,
>>>> 
>>>> Jun
>>>> 
>>>> 
>>>> 
>>>> On Sat, Sep 14, 2013 at 11:18 AM, Joe Stein <crypt...@gmail.com> wrote:
>>>> 
>>>>> How about creating a new class called RandomRefreshPartioner and copy
>>>> the
>>>>> DefaultPartitioner code to it and then revert the DefaultPartitioner
>>>> code.
>>>>> I appreciate this is a one time burden for folks using the existing
>>>>> 0.8-beta1 bumping into KAFKA-1017 in production having to switch to the
>>>>> RandomRefreshPartioner and when folks deploy to production will have to
>>>>> consider this property change.
>>>>> 
>>>>> I make this suggestion keeping in mind the new folks that on board with
>>>>> Kafka and when everyone is in development and testing mode for the first
>>>>> time their experience would be as expected from how it would work in
>>>>> production this way.  In dev/test when first using Kafka they won't
>>>> have so
>>>>> many producers for partitions but would look to parallelize their
>>>> consumers
>>>>> IMHO.
>>>>> 
>>>>> The random broker change sounds like maybe a bigger change now this late
>>>>> in the release cycle if we can accommodate folks trying Kafka for the
>>>> first
>>>>> time and through their development and testing along with full blown
>>>>> production deploys.
>>>>> 
>>>>> /*******************************************
>>>>> Joe Stein
>>>>> Founder, Principal Consultant
>>>>> Big Data Open Source Security LLC
>>>>> http://www.stealth.ly
>>>>> Twitter: @allthingshadoop
>>>>> ********************************************/
>>>>> 
>>>>> 
>>>>> On Sep 14, 2013, at 8:17 AM, Joel Koshy <jjkosh...@gmail.com> wrote:
>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for bringing this up - it is definitely an important point to
>>>>>>> discuss. The underlying issue of KAFKA-1017 was uncovered to some
>>>>> degree by
>>>>>>> the fact that in our deployment we did not significantly increase the
>>>>> total
>>>>>>> number of partitions over 0.7 - i.e., in 0.7 we had say four
>>>> partitions
>>>>> per
>>>>>>> broker, now we are using (say) eight partitions across the cluster.
>>>> So
>>>>> with
>>>>>>> random partitioning every producer would end up connecting to nearly
>>>>> every
>>>>>>> broker (unlike 0.7 in which we would connect to only one broker
>>>> within
>>>>> each
>>>>>>> reconnect interval). In a production-scale deployment that causes the
>>>>> high
>>>>>>> number of connections that KAFKA-1017 addresses.
>>>>>>> 
>>>>>>> You are right that the fix of sticking to one partition over the
>>>>> metadata
>>>>>>> refresh interval goes against true consumer parallelism, but this
>>>> would
>>>>> be
>>>>>>> the case only if there are few producers. If you have a sizable
>>>> number
>>>>> of
>>>>>>> producers on average all partitions would get uniform volumes of
>>>> data.
>>>>>>> 
>>>>>>> One tweak to KAFKA-1017 that I think is reasonable would be instead
>>>> of
>>>>>>> sticking to a random partition, stick to a random broker and send to
>>>>> random
>>>>>>> partitions within that broker. This would make the behavior closer to
>>>>> 0.7
>>>>>>> wrt number of connections and random partitioning provided the
>>>> number of
>>>>>>> partitions per broker is high enough, which is why I mentioned the
>>>>>>> partition count (in our usage) in 0.7 vs 0.8 above. Thoughts?
>>>>>>> 
>>>>>>> Joel
>>>>>>> 
>>>>>>> 
>>>>>>> On Friday, September 13, 2013, Joe Stein wrote:
>>>>>>>> 
>>>>>>>> First, let me apologize for not realizing/noticing this until today.
>>>>> One
>>>>>>>> reason I left my last company was not being paid to work on Kafka
>>>> nor
>>>>>>> being
>>>>>>> able to afford any time for a while to work on it. Now in my new gig
>>>>> (just
>>>>>>> wrapped up my first week, woo hoo) while I am still not "paid to
>>>> work on
>>>>>>> Kafka" I can afford some more time for it now and maybe in 6 months I
>>>>> will
>>>>>>> be able to hire folks to work on Kafka (with more and more time for
>>>>> myself
>>>>>>> to work on it too) while we also work on client projects (especially
>>>>> Kafka
>>>>>>> based ones).
>>>>>>> 
>>>>>>> So, I understand about the changes that were made to fix open file
>>>>> handles
>>>>>>> and make the random pinning be timed based (with a very large default
>>>>>>> time).  Got all that.
>>>>>>> 
>>>>>>> But, doesn't this completely negate what has been communicated to the
>>>>>>> community for a very long time and the expectation they have? I
>>>> think it
>>>>>>> does.
>>>>>>> 
>>>>>>> The expected functionality for random partitioning is that "This can
>>>> be
>>>>>>> done in a round-robin fashion simply to balance load" and that the
>>>>>>> "producer" does it for you.
>>>>>>> 
>>>>>>> Isn't a primary use case for partitions to paralyze consumers? If so
>>>>> then
>>>>>>> the expectation would be that all consumers would be getting in
>>>> parallel
>>>>>>> equally in a "round robin fashion" the data that was produced for the
>>>>>>> topic... simply to balance load...with the producer handling it and
>>>> with
>>>>>>> the client application not having to-do anything. This randomness
>>>>> occurring
>>>>>>> every 10 minutes can't balance load.
>>>>>>> 
>>>>>>> If users are going to work around this anyways (as I would honestly
>>>> do
>>>>> too)
>>>>>>> doing a pseudo semantic random key and essentially forcing real
>>>>> randomness
>>>>>>> to simply balance load to my consumers running in parallel would we
>>>>> still
>>>>>>> end up hitting the KAFKA-1017 problem anyways? If not then why can't
>>>> we
>>>>>>> just give users the functionality and put back the 3 lines of code 1)
>>>>>>> if(key == null) 2)  random.nextInt(numPartitions) 3) else ... If we
>>>>> would
>>>>>>> bump into KAFKA-1017 by working around it then we have not really
>>>> solved
>>>>>>> the root cause problem and removing expected functionality for a
>>>> corner
>>>>>>> case that might have other work arounds and/or code changes to solve
>>>> it
>>>>>>> another way or am I still not getting something?
>>>>>>> 
>>>>>>> Also, I was looking at testRandomPartitioner in AsyncProducerTest
>>>> and I
>>>>>>> don't see how this would ever fail, the assertion is always for
>>>>> partitionId
>>>>>>> == 0 and it should be checking that data is going to different
>>>>> partitions
>>>>>>> for a topic, right?
>>>>>>> 
>>>>>>> Let me know, I think this is an important discussion and even if it
>>>>> ends up
>>>>>>> as telling the community to only use one partition that is all you
>>>> need
>>>>> and
>>>>>>> partitions become our super columns (Apache Cassandra joke, its
>>>> funny)
>>>>> then
>>>>>>> we manage and support it and that is just how it is but if partitions
>>>>> are a
>>>>>>> good thing and having multiple consumers scale in parrelel for a
>>>> single
>>>>>>> topic also good then we have to manage and support that.
>>>>>>> 
>>>>>>> /*******************************************
>>>>>>> Joe Stein
>>>>>>> Founder, Principal Consultant
>>>>>>> Big Data Open Source Security LLC
>>>>>>> http://www.stealth.ly
>>>>>>> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
>>>>>>> ********************************************/
>>>>>>> 
>>>>> 
>>>> 
>>> 
>>>

Re: Random Partitioning Issue

Reply via email to