Jira ticket https://issues.apache.org/jira/browse/KAFKA-687
2013/1/7 Pablo Barrera González <pablo.barr...@gmail.com> > Thank you Jun and Neha > > I was trying to avoid adding more partitions. I have enough partitions if > you count all partitions in all topics. I understand the problem with > different data load per topic but the current schema does not solve this > problem either so we shouldn't be worse is we consider all partitions from > all topics at the same time. > > I will open the JIRA ticket to track this. > > Thanks again for the clarification. > > Cheers > > Pablo > > > > 2013/1/7 Neha Narkhede <neha.narkh...@gmail.com> > >> Pablo, >> >> That is a good suggestion. Ideally, the partitions across all topics >> should >> be distributed evenly across consumer streams instead of a per-topic based >> decision. There is no particular advantage to the current scheme of >> per-topic rebalancing that I can think of. Would you mind filing a JIRA to >> track this improvement ? >> >> Thanks, >> Neha >> >> >> On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <jun...@gmail.com> wrote: >> >> > Pablo, >> > >> > Currently, partition is the smallest unit that we distribute data among >> > consumers (in the same consumer group). So, if the # of consumers is >> larger >> > than the total number of partitions in a Kafka cluster (across all >> > brokers), some consumers will never get any data. Such a decision is >> done >> > on a per topic basis. If a consumer consumes multiple topics, it would >> make >> > sense to divide partitions across all topics to consumers. We haven't >> done >> > that yet. Part of the reason is that we need to figure out how to >> balance >> > the data across topics since they can be of different sizes. We can look >> > into that post 0.8. >> > >> > For now, the solution is to increase the number of partitions on the >> > broker. >> > >> > Thanks, >> > >> > Jun >> > >> > On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González < >> > pablo.barr...@gmail.com> wrote: >> > >> > > Hello >> > > >> > > We are starting to use Kafka in production but we found an unexpected >> (at >> > > least for me) behavior with the use of partitions. We have a bunch of >> > > topics with a few partitions each. We try to consume all data from >> > several >> > > consumers (just one consumer group). >> > > >> > > The problem is in the rebalance step. The rebalance splits the >> partitions >> > > per topic between all consumers. So if you have 100 topics but only 2 >> > > partitions each and 10 consumers only two consumers will be used. That >> > is, >> > > for each topic all partitions will be listed and shared between the >> > > consumers in the consumer group in order (not randomly). >> > > >> > > This behavior is also described in algorithm 1 of the original kafka >> > paper >> > > [1]. >> > > >> > > I don't understand this decision. Why is split by topic? Does it make >> > sense >> > > to divide all partitions from all topics between all the consumers in >> the >> > > consumer group? I don't see the reason of this so I would like to hear >> > your >> > > opinion before changing the code. >> > > >> > > We are using kafka 0.7.1. >> > > >> > > Thank you in advance >> > > >> > > Pablo >> > > >> > > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay >> > Kreps, >> > > Neha Narkhede and Jun Rao. >> > > >> > > >> > >> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf >> > > >> > >> > >