First, let me apologize for not realizing/noticing this until today. One reason I left my last company was not being paid to work on Kafka nor being able to afford any time for a while to work on it. Now in my new gig (just wrapped up my first week, woo hoo) while I am still not "paid to work on Kafka" I can afford some more time for it now and maybe in 6 months I will be able to hire folks to work on Kafka (with more and more time for myself to work on it too) while we also work on client projects (especially Kafka based ones).
So, I understand about the changes that were made to fix open file handles and make the random pinning be timed based (with a very large default time). Got all that. But, doesn't this completely negate what has been communicated to the community for a very long time and the expectation they have? I think it does. The expected functionality for random partitioning is that "This can be done in a round-robin fashion simply to balance load" and that the "producer" does it for you. Isn't a primary use case for partitions to paralyze consumers? If so then the expectation would be that all consumers would be getting in parallel equally in a "round robin fashion" the data that was produced for the topic... simply to balance load...with the producer handling it and with the client application not having to-do anything. This randomness occurring every 10 minutes can't balance load. If users are going to work around this anyways (as I would honestly do too) doing a pseudo semantic random key and essentially forcing real randomness to simply balance load to my consumers running in parallel would we still end up hitting the KAFKA-1017 problem anyways? If not then why can't we just give users the functionality and put back the 3 lines of code 1) if(key == null) 2) random.nextInt(numPartitions) 3) else ... If we would bump into KAFKA-1017 by working around it then we have not really solved the root cause problem and removing expected functionality for a corner case that might have other work arounds and/or code changes to solve it another way or am I still not getting something? Also, I was looking at testRandomPartitioner in AsyncProducerTest and I don't see how this would ever fail, the assertion is always for partitionId == 0 and it should be checking that data is going to different partitions for a topic, right? Let me know, I think this is an important discussion and even if it ends up as telling the community to only use one partition that is all you need and partitions become our super columns (Apache Cassandra joke, its funny) then we manage and support it and that is just how it is but if partitions are a good thing and having multiple consumers scale in parrelel for a single topic also good then we have to manage and support that. /******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/