Gerard, we have a similar use case we are using Kafka for, and are setting max.poll.interval.ms to a large value in order to handle the worst-case scenario.
Rebalancing is indeed a big problem with this approach (and not just for "new" consumers as you mentioned -- adding consumers causes a stop-the-world rebalance on all existing consumers as well). The static consumer groups protocol introduced in Kafka 2.3 helps quite a lot with the failed-and-restarting consumer case (though I ran into several bugs with it and still have a stream that won't start in dev, it seems to be working in prod now with patched Kafka brokers), but does not handle scaling your consumers up (or down). Scaling consumers is still a stop-the-world rebalance, and you can expect it to take at least the max time any particular in-flight message takes to process (if all goes well), because your consumers will essentially be sitting around idle waiting for every other consumer to finish. Theoretically, the incremental rebalancing capabilities coming out soon (KAFKA-8179) should help resolve this. The other issue you will run into with long processing is "hot partitions/consumers" -- with a large backlog of such messages it is inevitable that the consumers of some partitions will take a lot longer to process than some other partitions, which means *some* of your consumers will just be sitting idle, while others are lagging *way* behind. There is no easy solution to this that I have found because one partition is always processed by at most one Kafka consumer. You'll have to work around this by reading a bunch of messages, distributing their work yourself, tracking completions outside of Kafka, and then committing offsets back to Kafka manually. Honestly, because of these issues, right now I am seriously considering a migration of our Kafka workloads to Apache Pulsar, which supports the same semantics as Kafka, but also supports work/message-queue style messaging *much* better, as it uses per-message acknowledgement, and subscriptions are not limited to the number of partitions i.e. you can have N partitions, and M+N consumers. Regards, Raman On Thu, Jun 13, 2019 at 1:54 PM Mark Anderson <manderso...@gmail.com> wrote: > > We have a different use case where we stop consuming due to connection to > an external system being down. > > In this case we sleep for the same period as our poll timeout would be and > recommit the previous offset. This stops the consumer going stale and > avoids increasing the max interval. > > Perhaps you could do something similar? > > Mark > > > On Thu, 13 Jun 2019, 17:49 Murphy, Gerard, <gerard.mur...@sap.com> wrote: > > > Hi, > > > > I am wondering if there is something I am missing about my set up to > > facilitate long running jobs. > > > > For my purposes it is ok to have `At most once` message delivery, this > > means it is not required to think about committing offsets (or at least it > > is ok to commit each message offset upon receiving it). > > > > I have the following in order to achieve the competing consumer pattern: > > > > * A topic > > * X consumers in the same group > > * P partitions in a topic (where P >= X always) > > > > My problem is that I have messages that can take ~15 minutes (but this may > > fluctuate by up to 50% lets say) in order to process. In order to avoid > > consumers having their partition assignments revoked I have increased the > > value of `max.poll.interval.ms` to reflect this. > > However this comes with some negative consequences: > > > > * if some message exceeds this length of time then in a worst case > > scenario a the consumer processing this message will have to wait up to the > > value of `max.poll.interval.ms` for a rebalance > > * if I need to scale and increase the number of consumers based on > > load then any new consumers might also have to wait the value of ` > > max.poll.interval.ms` for a rebalance to occur in order to process any > > messages > > > > As it stands at the moment I see that I can proceed as follows: > > > > * Set `max.poll.interval.ms` to be a small value and accept that > > every consumer processing every message will time out and go through the > > process of having assignments revoked and waiting a small amount of time > > for a rebalance > > > > However I do not like this, and am considering looking at alternative > > technology for my message queue as I do not see any obvious way around this. > > Admittedly I am new to Kafka, and it is just a gut feeling that the above > > is not desirable. > > I have used RabbitMQ in the past for these scenarios, however we need > > Kafka in our architecture for other purposes at the moment and it would be > > nice not to have to introduce another technology if Kafka can achieve this. > > > > I appreciate any advise that anybody can offer on this subject. > > > > Regards, > > Ger > >