(From http://kafka.apache.org/design.html) one potential benefit of the existing rebalancing logic is to reduce the number of connections to brokers per consumer instance. However, if you have a large number of partitions and few brokers and/or consumer instances then it wouldn't really help; so I agree it would be good to implement KAFKA-687. KAFKA-564<https://issues.apache.org/jira/browse/KAFKA-564> may also be related - i.e., it may be easier to implement along with/after KAFKA-687,
Joel On Mon, Jan 7, 2013 at 10:44 AM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > Pablo, > > That is a good suggestion. Ideally, the partitions across all topics should > be distributed evenly across consumer streams instead of a per-topic based > decision. There is no particular advantage to the current scheme of > per-topic rebalancing that I can think of. Would you mind filing a JIRA to > track this improvement ? > > Thanks, > Neha > > > On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <jun...@gmail.com> wrote: > > > Pablo, > > > > Currently, partition is the smallest unit that we distribute data among > > consumers (in the same consumer group). So, if the # of consumers is > larger > > than the total number of partitions in a Kafka cluster (across all > > brokers), some consumers will never get any data. Such a decision is done > > on a per topic basis. If a consumer consumes multiple topics, it would > make > > sense to divide partitions across all topics to consumers. We haven't > done > > that yet. Part of the reason is that we need to figure out how to balance > > the data across topics since they can be of different sizes. We can look > > into that post 0.8. > > > > For now, the solution is to increase the number of partitions on the > > broker. > > > > Thanks, > > > > Jun > > > > On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González < > > pablo.barr...@gmail.com> wrote: > > > > > Hello > > > > > > We are starting to use Kafka in production but we found an unexpected > (at > > > least for me) behavior with the use of partitions. We have a bunch of > > > topics with a few partitions each. We try to consume all data from > > several > > > consumers (just one consumer group). > > > > > > The problem is in the rebalance step. The rebalance splits the > partitions > > > per topic between all consumers. So if you have 100 topics but only 2 > > > partitions each and 10 consumers only two consumers will be used. That > > is, > > > for each topic all partitions will be listed and shared between the > > > consumers in the consumer group in order (not randomly). > > > > > > This behavior is also described in algorithm 1 of the original kafka > > paper > > > [1]. > > > > > > I don't understand this decision. Why is split by topic? Does it make > > sense > > > to divide all partitions from all topics between all the consumers in > the > > > consumer group? I don't see the reason of this so I would like to hear > > your > > > opinion before changing the code. > > > > > > We are using kafka 0.7.1. > > > > > > Thank you in advance > > > > > > Pablo > > > > > > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay > > Kreps, > > > Neha Narkhede and Jun Rao. > > > > > > > > > http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf > > > > > >