Re: Random Partitioning Issue

Jay Kreps Tue, 17 Sep 2013 12:29:03 -0700

I would be in favor of that. I agree this is better than 0.7.

-Jay



On Tue, Sep 17, 2013 at 10:19 AM, Joel Koshy <[email protected]> wrote:

> I agree that minimizing the number of producer connections (while
> being a good thing) is really required in very large production
> deployments, and the net-effect of the existing change is
> counter-intuitive to users who expect an immediate even distribution
> across _all_ partitions of the topic.
>
> However, I don't think it is a hack because it is almost exactly the
> same behavior as 0.7 in one of its modes. The 0.7 producer (which I
> think was even more confusing) had three modes:
> i) ZK send
> ii) Config send(a): static list of broker1:port1,broker2:port2,etc.
> iii) Config send(b): static list of a hardwareVIP:VIPport
>
> (i) and (ii) would achieve even distribution. (iii) would effectively
> select one broker and distribute to partitions on that broker within
> each reconnect interval. (iii) is very similar to what we now do in
> 0.8. (Although we stick to one partition during each metadata refresh
> interval that can be changed to stick to one broker and distribute
> across partitions on that broker).
>
> At the same time, I agree with Joe's suggestion that we should keep
> the more intuitive pre-KAFKA-1017 behavior as the default and move the
> change in KAFKA-1017 to a more specific partitioner implementation.
>
> Joel
>
>
> On Sun, Sep 15, 2013 at 8:44 AM, Jay Kreps <[email protected]> wrote:
> > Let me ask another question which I think is more objective. Let's say
> 100
> > random, smart infrastructure specialists try Kafka, of these 100 how many
> > do you believe will
> > 1. Say that this behavior is what they expected to happen?
> > 2. Be happy with this behavior?
> > I am not being facetious I am genuinely looking for a numerical
> estimate. I
> > am trying to figure out if nobody thought about this or if my estimate is
> > just really different. For what it is worth my estimate is 0 and 5
> > respectively.
> >
> > This would be fine expect that we changed it from the good behavior to
> the
> > bad behavior to fix an issue that probably only we have.
> >
> > -Jay
> >
> >
> > On Sun, Sep 15, 2013 at 8:37 AM, Jay Kreps <[email protected]> wrote:
> >
> >> I just took a look at this change. I agree with Joe, not to put to fine
> a
> >> point on it, but this is a confusing hack.
> >>
> >> Jun, I don't think wanting to minimizing the number of TCP connections
> is
> >> going to be a very common need for people with less than 10k producers.
> I
> >> also don't think people are going to get very good load balancing out of
> >> this because most people don't have a ton of producers. I think instead
> we
> >> will spend the next year explaining this behavior which 99% of people
> will
> >> think is a bug (because it is crazy, non-intuitive, and breaks their
> usage).
> >>
> >> Why was this done by adding special default behavior in the null key
> case
> >> instead of as a partitioner? The argument that the partitioner interface
> >> doesn't have sufficient information to choose a partition is not a good
> >> argument for hacking in changes to the default, it is an argument for *
> >> improving* the partitioner interface.
> >>
> >> The whole point of a partitioner interface is to make it possible to
> plug
> >> in non-standard behavior like this, right?
> >>
> >> -Jay
> >>
> >>
> >> On Sat, Sep 14, 2013 at 8:15 PM, Jun Rao <[email protected]> wrote:
> >>
> >>> Joe,
> >>>
> >>> Thanks for bringing this up. I want to clarify this a bit.
> >>>
> >>> 1. Currently, the producer side logic is that if the partitioning key
> is
> >>> not provided (i.e., it is null), the partitioner won't be called. We
> did
> >>> that because we want to select a random and "available" partition to
> send
> >>> messages so that if some partitions are temporarily unavailable
> (because
> >>> of
> >>> broker failures), messages can still be sent to other partitions. Doing
> >>> this in the partitioner is difficult since the partitioner doesn't know
> >>> which partitions are currently available (the DefaultEventHandler
> does).
> >>>
> >>> 2. As Joel said, the common use case in production is that there are
> many
> >>> more producers than #partitions in a topic. In this case, sticking to a
> >>> partition for a few minutes is not going to cause too much imbalance in
> >>> the
> >>> partitions and has the benefit of reducing the # of socket
> connections. My
> >>> feeling is that this will benefit most production users. In fact, if
> one
> >>> uses a hardware load balancer for producing data in 0.7, it behaves in
> >>> exactly the same way (a producer will stick to a broker until the
> >>> reconnect
> >>> interval is reached).
> >>>
> >>> 3. It is true that If one is testing a topic with more than one
> partition
> >>> (which is not the default value), this behavior can be a bit weird.
> >>> However, I think it can be mitigated by running multiple test producer
> >>> instances.
> >>>
> >>> 4. Someone reported in the mailing list that all data shows in only one
> >>> partition after a few weeks. This is clearly not the expected
> behavior. We
> >>> can take a closer look to see if this is real issue.
> >>>
> >>> Do you think these address your concerns?
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>>
> >>>
> >>> On Sat, Sep 14, 2013 at 11:18 AM, Joe Stein <[email protected]>
> wrote:
> >>>
> >>> > How about creating a new class called RandomRefreshPartioner and copy
> >>> the
> >>> > DefaultPartitioner code to it and then revert the DefaultPartitioner
> >>> code.
> >>> >  I appreciate this is a one time burden for folks using the existing
> >>> > 0.8-beta1 bumping into KAFKA-1017 in production having to switch to
> the
> >>> > RandomRefreshPartioner and when folks deploy to production will have
> to
> >>> > consider this property change.
> >>> >
> >>> > I make this suggestion keeping in mind the new folks that on board
> with
> >>> > Kafka and when everyone is in development and testing mode for the
> first
> >>> > time their experience would be as expected from how it would work in
> >>> > production this way.  In dev/test when first using Kafka they won't
> >>> have so
> >>> > many producers for partitions but would look to parallelize their
> >>> consumers
> >>> > IMHO.
> >>> >
> >>> > The random broker change sounds like maybe a bigger change now this
> late
> >>> > in the release cycle if we can accommodate folks trying Kafka for the
> >>> first
> >>> > time and through their development and testing along with full blown
> >>> > production deploys.
> >>> >
> >>> > /*******************************************
> >>> >  Joe Stein
> >>> >  Founder, Principal Consultant
> >>> >  Big Data Open Source Security LLC
> >>> >  http://www.stealth.ly
> >>> >  Twitter: @allthingshadoop
> >>> > ********************************************/
> >>> >
> >>> >
> >>> > On Sep 14, 2013, at 8:17 AM, Joel Koshy <[email protected]> wrote:
> >>> >
> >>> > >>
> >>> > >>
> >>> > >> Thanks for bringing this up - it is definitely an important point
> to
> >>> > >> discuss. The underlying issue of KAFKA-1017 was uncovered to some
> >>> > degree by
> >>> > >> the fact that in our deployment we did not significantly increase
> the
> >>> > total
> >>> > >> number of partitions over 0.7 - i.e., in 0.7 we had say four
> >>> partitions
> >>> > per
> >>> > >> broker, now we are using (say) eight partitions across the
> cluster.
> >>> So
> >>> > with
> >>> > >> random partitioning every producer would end up connecting to
> nearly
> >>> > every
> >>> > >> broker (unlike 0.7 in which we would connect to only one broker
> >>> within
> >>> > each
> >>> > >> reconnect interval). In a production-scale deployment that causes
> the
> >>> > high
> >>> > >> number of connections that KAFKA-1017 addresses.
> >>> > >>
> >>> > >> You are right that the fix of sticking to one partition over the
> >>> > metadata
> >>> > >> refresh interval goes against true consumer parallelism, but this
> >>> would
> >>> > be
> >>> > >> the case only if there are few producers. If you have a sizable
> >>> number
> >>> > of
> >>> > >> producers on average all partitions would get uniform volumes of
> >>> data.
> >>> > >>
> >>> > >> One tweak to KAFKA-1017 that I think is reasonable would be
> instead
> >>> of
> >>> > >> sticking to a random partition, stick to a random broker and send
> to
> >>> > random
> >>> > >> partitions within that broker. This would make the behavior
> closer to
> >>> > 0.7
> >>> > >> wrt number of connections and random partitioning provided the
> >>> number of
> >>> > >> partitions per broker is high enough, which is why I mentioned the
> >>> > >> partition count (in our usage) in 0.7 vs 0.8 above. Thoughts?
> >>> > >>
> >>> > >> Joel
> >>> > >>
> >>> > >>
> >>> > >> On Friday, September 13, 2013, Joe Stein wrote:
> >>> > >>>
> >>> > >>> First, let me apologize for not realizing/noticing this until
> today.
> >>> >  One
> >>> > >>> reason I left my last company was not being paid to work on Kafka
> >>> nor
> >>> > >> being
> >>> > >> able to afford any time for a while to work on it. Now in my new
> gig
> >>> > (just
> >>> > >> wrapped up my first week, woo hoo) while I am still not "paid to
> >>> work on
> >>> > >> Kafka" I can afford some more time for it now and maybe in 6
> months I
> >>> > will
> >>> > >> be able to hire folks to work on Kafka (with more and more time
> for
> >>> > myself
> >>> > >> to work on it too) while we also work on client projects
> (especially
> >>> > Kafka
> >>> > >> based ones).
> >>> > >>
> >>> > >> So, I understand about the changes that were made to fix open file
> >>> > handles
> >>> > >> and make the random pinning be timed based (with a very large
> default
> >>> > >> time).  Got all that.
> >>> > >>
> >>> > >> But, doesn't this completely negate what has been communicated to
> the
> >>> > >> community for a very long time and the expectation they have? I
> >>> think it
> >>> > >> does.
> >>> > >>
> >>> > >> The expected functionality for random partitioning is that "This
> can
> >>> be
> >>> > >> done in a round-robin fashion simply to balance load" and that the
> >>> > >> "producer" does it for you.
> >>> > >>
> >>> > >> Isn't a primary use case for partitions to paralyze consumers? If
> so
> >>> > then
> >>> > >> the expectation would be that all consumers would be getting in
> >>> parallel
> >>> > >> equally in a "round robin fashion" the data that was produced for
> the
> >>> > >> topic... simply to balance load...with the producer handling it
> and
> >>> with
> >>> > >> the client application not having to-do anything. This randomness
> >>> > occurring
> >>> > >> every 10 minutes can't balance load.
> >>> > >>
> >>> > >> If users are going to work around this anyways (as I would
> honestly
> >>> do
> >>> > too)
> >>> > >> doing a pseudo semantic random key and essentially forcing real
> >>> > randomness
> >>> > >> to simply balance load to my consumers running in parallel would
> we
> >>> > still
> >>> > >> end up hitting the KAFKA-1017 problem anyways? If not then why
> can't
> >>> we
> >>> > >> just give users the functionality and put back the 3 lines of
> code 1)
> >>> > >> if(key == null) 2)  random.nextInt(numPartitions) 3) else ... If
> we
> >>> > would
> >>> > >> bump into KAFKA-1017 by working around it then we have not really
> >>> solved
> >>> > >> the root cause problem and removing expected functionality for a
> >>> corner
> >>> > >> case that might have other work arounds and/or code changes to
> solve
> >>> it
> >>> > >> another way or am I still not getting something?
> >>> > >>
> >>> > >> Also, I was looking at testRandomPartitioner in AsyncProducerTest
> >>> and I
> >>> > >> don't see how this would ever fail, the assertion is always for
> >>> > partitionId
> >>> > >> == 0 and it should be checking that data is going to different
> >>> > partitions
> >>> > >> for a topic, right?
> >>> > >>
> >>> > >> Let me know, I think this is an important discussion and even if
> it
> >>> > ends up
> >>> > >> as telling the community to only use one partition that is all you
> >>> need
> >>> > and
> >>> > >> partitions become our super columns (Apache Cassandra joke, its
> >>> funny)
> >>> > then
> >>> > >> we manage and support it and that is just how it is but if
> partitions
> >>> > are a
> >>> > >> good thing and having multiple consumers scale in parrelel for a
> >>> single
> >>> > >> topic also good then we have to manage and support that.
> >>> > >>
> >>> > >> /*******************************************
> >>> > >> Joe Stein
> >>> > >> Founder, Principal Consultant
> >>> > >> Big Data Open Source Security LLC
> >>> > >> http://www.stealth.ly
> >>> > >> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop
> >
> >>> > >> ********************************************/
> >>> > >>
> >>> >
> >>>
> >>
> >>
>

Re: Random Partitioning Issue

Reply via email to