I just took a look at this change. I agree with Joe, not to put to fine a point on it, but this is a confusing hack.
Jun, I don't think wanting to minimizing the number of TCP connections is going to be a very common need for people with less than 10k producers. I also don't think people are going to get very good load balancing out of this because most people don't have a ton of producers. I think instead we will spend the next year explaining this behavior which 99% of people will think is a bug (because it is crazy, non-intuitive, and breaks their usage). Why was this done by adding special default behavior in the null key case instead of as a partitioner? The argument that the partitioner interface doesn't have sufficient information to choose a partition is not a good argument for hacking in changes to the default, it is an argument for * improving* the partitioner interface. The whole point of a partitioner interface is to make it possible to plug in non-standard behavior like this, right? -Jay On Sat, Sep 14, 2013 at 8:15 PM, Jun Rao <jun...@gmail.com> wrote: > Joe, > > Thanks for bringing this up. I want to clarify this a bit. > > 1. Currently, the producer side logic is that if the partitioning key is > not provided (i.e., it is null), the partitioner won't be called. We did > that because we want to select a random and "available" partition to send > messages so that if some partitions are temporarily unavailable (because of > broker failures), messages can still be sent to other partitions. Doing > this in the partitioner is difficult since the partitioner doesn't know > which partitions are currently available (the DefaultEventHandler does). > > 2. As Joel said, the common use case in production is that there are many > more producers than #partitions in a topic. In this case, sticking to a > partition for a few minutes is not going to cause too much imbalance in the > partitions and has the benefit of reducing the # of socket connections. My > feeling is that this will benefit most production users. In fact, if one > uses a hardware load balancer for producing data in 0.7, it behaves in > exactly the same way (a producer will stick to a broker until the reconnect > interval is reached). > > 3. It is true that If one is testing a topic with more than one partition > (which is not the default value), this behavior can be a bit weird. > However, I think it can be mitigated by running multiple test producer > instances. > > 4. Someone reported in the mailing list that all data shows in only one > partition after a few weeks. This is clearly not the expected behavior. We > can take a closer look to see if this is real issue. > > Do you think these address your concerns? > > Thanks, > > Jun > > > > On Sat, Sep 14, 2013 at 11:18 AM, Joe Stein <crypt...@gmail.com> wrote: > > > How about creating a new class called RandomRefreshPartioner and copy the > > DefaultPartitioner code to it and then revert the DefaultPartitioner > code. > > I appreciate this is a one time burden for folks using the existing > > 0.8-beta1 bumping into KAFKA-1017 in production having to switch to the > > RandomRefreshPartioner and when folks deploy to production will have to > > consider this property change. > > > > I make this suggestion keeping in mind the new folks that on board with > > Kafka and when everyone is in development and testing mode for the first > > time their experience would be as expected from how it would work in > > production this way. In dev/test when first using Kafka they won't have > so > > many producers for partitions but would look to parallelize their > consumers > > IMHO. > > > > The random broker change sounds like maybe a bigger change now this late > > in the release cycle if we can accommodate folks trying Kafka for the > first > > time and through their development and testing along with full blown > > production deploys. > > > > /******************************************* > > Joe Stein > > Founder, Principal Consultant > > Big Data Open Source Security LLC > > http://www.stealth.ly > > Twitter: @allthingshadoop > > ********************************************/ > > > > > > On Sep 14, 2013, at 8:17 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > > > > >> > > >> > > >> Thanks for bringing this up - it is definitely an important point to > > >> discuss. The underlying issue of KAFKA-1017 was uncovered to some > > degree by > > >> the fact that in our deployment we did not significantly increase the > > total > > >> number of partitions over 0.7 - i.e., in 0.7 we had say four > partitions > > per > > >> broker, now we are using (say) eight partitions across the cluster. So > > with > > >> random partitioning every producer would end up connecting to nearly > > every > > >> broker (unlike 0.7 in which we would connect to only one broker within > > each > > >> reconnect interval). In a production-scale deployment that causes the > > high > > >> number of connections that KAFKA-1017 addresses. > > >> > > >> You are right that the fix of sticking to one partition over the > > metadata > > >> refresh interval goes against true consumer parallelism, but this > would > > be > > >> the case only if there are few producers. If you have a sizable number > > of > > >> producers on average all partitions would get uniform volumes of data. > > >> > > >> One tweak to KAFKA-1017 that I think is reasonable would be instead of > > >> sticking to a random partition, stick to a random broker and send to > > random > > >> partitions within that broker. This would make the behavior closer to > > 0.7 > > >> wrt number of connections and random partitioning provided the number > of > > >> partitions per broker is high enough, which is why I mentioned the > > >> partition count (in our usage) in 0.7 vs 0.8 above. Thoughts? > > >> > > >> Joel > > >> > > >> > > >> On Friday, September 13, 2013, Joe Stein wrote: > > >>> > > >>> First, let me apologize for not realizing/noticing this until today. > > One > > >>> reason I left my last company was not being paid to work on Kafka nor > > >> being > > >> able to afford any time for a while to work on it. Now in my new gig > > (just > > >> wrapped up my first week, woo hoo) while I am still not "paid to work > on > > >> Kafka" I can afford some more time for it now and maybe in 6 months I > > will > > >> be able to hire folks to work on Kafka (with more and more time for > > myself > > >> to work on it too) while we also work on client projects (especially > > Kafka > > >> based ones). > > >> > > >> So, I understand about the changes that were made to fix open file > > handles > > >> and make the random pinning be timed based (with a very large default > > >> time). Got all that. > > >> > > >> But, doesn't this completely negate what has been communicated to the > > >> community for a very long time and the expectation they have? I think > it > > >> does. > > >> > > >> The expected functionality for random partitioning is that "This can > be > > >> done in a round-robin fashion simply to balance load" and that the > > >> "producer" does it for you. > > >> > > >> Isn't a primary use case for partitions to paralyze consumers? If so > > then > > >> the expectation would be that all consumers would be getting in > parallel > > >> equally in a "round robin fashion" the data that was produced for the > > >> topic... simply to balance load...with the producer handling it and > with > > >> the client application not having to-do anything. This randomness > > occurring > > >> every 10 minutes can't balance load. > > >> > > >> If users are going to work around this anyways (as I would honestly do > > too) > > >> doing a pseudo semantic random key and essentially forcing real > > randomness > > >> to simply balance load to my consumers running in parallel would we > > still > > >> end up hitting the KAFKA-1017 problem anyways? If not then why can't > we > > >> just give users the functionality and put back the 3 lines of code 1) > > >> if(key == null) 2) random.nextInt(numPartitions) 3) else ... If we > > would > > >> bump into KAFKA-1017 by working around it then we have not really > solved > > >> the root cause problem and removing expected functionality for a > corner > > >> case that might have other work arounds and/or code changes to solve > it > > >> another way or am I still not getting something? > > >> > > >> Also, I was looking at testRandomPartitioner in AsyncProducerTest and > I > > >> don't see how this would ever fail, the assertion is always for > > partitionId > > >> == 0 and it should be checking that data is going to different > > partitions > > >> for a topic, right? > > >> > > >> Let me know, I think this is an important discussion and even if it > > ends up > > >> as telling the community to only use one partition that is all you > need > > and > > >> partitions become our super columns (Apache Cassandra joke, its funny) > > then > > >> we manage and support it and that is just how it is but if partitions > > are a > > >> good thing and having multiple consumers scale in parrelel for a > single > > >> topic also good then we have to manage and support that. > > >> > > >> /******************************************* > > >> Joe Stein > > >> Founder, Principal Consultant > > >> Big Data Open Source Security LLC > > >> http://www.stealth.ly > > >> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> > > >> ********************************************/ > > >> > > >