Apologies, just discovered KIP-134, which would address our specific
problem directly via delaying consumer join.

On Mon, Jul 10, 2017 at 4:53 PM, Greg Fodor <gfo...@gmail.com> wrote:

> With Kafka Streams, it's common to spin up and shut down clusters of
> consumers by performing a graceful shutdown and restart during a deploy.
> One thing we've been running into is that during the startup and shutdown
> of a kafka streams clusters you often can have multiple rebalances as the
> consumer coordinator sees nodes come on or go offline.
>
> This is problematic since this means that for a brief period, the consumer
> group and kafka streams begins bootstrapping state stores and/or starting
> processors with a mis-allocated set of tasks. For example, if we have 10
> nodes, and we perform a (relatively) simultaneous graceful shutdown, what
> can happen is that 8 of the 10 nodes signal shutdown and 2 of them have a
> rebalance triggered before they are finished shutting down. This can result
> in a brief task re-allocation, state store re-initialization and
> materialization, and so on, but with all of the partition topics and tasks
> across the job assigned to just two nodes, resulting in excessive load and
> delays in restarting.
>
> One of the ways to solve this problem that may be generally useful would
> be to add a consumer configuration that declares the minimum number of
> consumers in the group that must be available for the rebalancing process
> to complete (along with a timeout.) This could help in a number of ways:
>
> - Properly dealing with startup/shutdown rebalances
> - Capping the # of partition topic assignments to a specific consumer
> - Preventing resource saturation both at the consumer level or cascading
> downstream
> - Providing a meaningful event to alert on if there is a capacity deficit
>
> I think for this to be done properly it is probably something the
> coordinator would need to take care of, but one could also imagine a
> version where the consumers themselves defer consumption unless the group
> constraints are satisfied. A v2 could also potentially introduce other
> invariants that must be true for a consumer to begin consumption.
>
> Would love to know if there is already a better way to solve this problem,
> of course!
>

Reply via email to