Jun, Joel, The issue here is exactly which threads are left out, and which threads are assigned partitions. Maybe I am missing something but what I want is to balance consuming threads across machines/processes, regardless of the amount of threads the machine launches (side effect: this way if you have more threads than partitions you get a reserve force awaiting to charge in).
example: launching 4 processes on 4 different machines with 4 threads per process on 12 partition topic will have each machine with 3 assigned threads and one doing nothing. more over no matter what number of threads each process will have , as long as it is bigger then 3, the end result will stay the same with 3 assigned threads per machine, and the rest of them doing nothing. Ideally, I would want something like consumer set/ensemble/{what ever word not group} that will be used to denote a group of threads on a machine, so that when specific threads request to join a consumer group they will be elected so that they are balanced across the machine denoted by the consumer set/ensemble identifier. will partition.assignment.strategy="roundrobin" help with that? 10x, Shlomi On Thu, Oct 30, 2014 at 4:00 AM, Joel Koshy <jjkosh...@gmail.com> wrote: > Shlomi, > > If you are on trunk, and your consumer subscriptions are identical > then you can try a slightly different partition assignment strategy. > Try setting partition.assignment.strategy="roundrobin" in your > consumer config. > > Thanks, > > Joel > > On Wed, Oct 29, 2014 at 06:29:30PM -0700, Jun Rao wrote: > > By consumer, I actually mean consumer threads (the thread # you used when > > creating consumer streams). So, if you have 4 consumers, each with 4 > > threads, 4 of the threads will not get any data with 12 partitions. It > > sounds like that's not what you get? What's the output of the > > ConsumerOffsetChecker (see http://kafka.apache.org/documentation.html)? > > > > For consumer.id, you don't need to set it in general. We generate some > uuid > > automatically. > > > > Thanks, > > > > Jun > > > > On Tue, Oct 28, 2014 at 4:59 AM, Shlomi Hazan <shl...@viber.com> wrote: > > > > > Jun, > > > > > > I hear you say "partitions are evenly distributed among all consumers > in > > > the same group", yet I did bump into a case where launching a process > with > > > X high level consumer API threads took over all partitions, sending > > > existing consumers to be unemployed. > > > > > > According to the claim above, and if I am not mistaken: > > > on a topic T with 12 partitions and 3 consumers C1-C3 on the same group > > > with 4 threads each, > > > adding a new consumer C4 with 12 threads should yield the following > > > balance: > > > C1-C3 each relinquish a single partition holding only 3 partitions > each. > > > C4 holds the 3 partitions relinquished by C1-C3. > > > Yet, in the case I described what happened is that C4 gained all 12 > > > partitions and sent C1-C3 out of business with 0 partitions each. > > > Now maybe I overlooked something but I think I did see that happen. > > > > > > BTW > > > What key is used to distinguish one consumer from another? " > consumer.id"? > > > docs for "consumer.id" are "Generated automatically if not set." > > > What is the best practice for setting it's value? leave empty? is > server > > > host name good enough? what are the considerations? > > > When using the high level consumer API, are all threads identified as > the > > > same consumer? I guess they are, right?... > > > > > > Thanks, > > > Shlomi > > > > > > > > > On Tue, Oct 28, 2014 at 4:21 AM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > You can take a look at the "consumer rebalancing algorithm" part in > > > > http://kafka.apache.org/documentation.html. Basically, partitions > are > > > > evenly distributed among all consumers in the same group. If there > are > > > more > > > > consumers in a group than partitions, some consumers will never get > any > > > > data. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Mon, Oct 27, 2014 at 4:14 AM, Shlomi Hazan <shl...@viber.com> > wrote: > > > > > > > > > Hi All, > > > > > > > > > > Using Kafka's high consumer API I have bumped into a situation > where > > > > > launching a consumer process P1 with X consuming threads on a topic > > > with > > > > X > > > > > partition kicks out all other existing consumer threads that > consumed > > > > prior > > > > > to launching the process P. > > > > > That is, consumer process P is stealing all partitions from all > other > > > > > consumer processes. > > > > > > > > > > While understandable, it makes it hard to size & deploy a cluster > with > > > a > > > > > number of partitions that will both allow balancing of consumption > > > across > > > > > consuming processes, dividing the partitions across consumers by > > > setting > > > > > each consumer with it's share of the total number of partitions on > the > > > > > consumed topic, and on the other hand provide room for growth and > > > > addition > > > > > of new consumers to help with increasing traffic into the cluster > and > > > the > > > > > topic. > > > > > > > > > > This stealing effect forces me to have more partitions then really > > > needed > > > > > at the moment, planning for future growth, or stick to what I need > and > > > > > trust the option to add partitions which comes with a price in > terms of > > > > > restarting consumers, bumping into out of order messages (hash > > > > > partitioning) etc. > > > > > > > > > > Is this policy of stealing is intended, or did I just jump to > > > > conclusions? > > > > > what is the way to cope with the sizing question? > > > > > > > > > > Shlomi > > > > > > > > > > > > > >