Apologies, just discovered KIP-134, which would address our specific problem directly via delaying consumer join.
On Mon, Jul 10, 2017 at 4:53 PM, Greg Fodor <gfo...@gmail.com> wrote: > With Kafka Streams, it's common to spin up and shut down clusters of > consumers by performing a graceful shutdown and restart during a deploy. > One thing we've been running into is that during the startup and shutdown > of a kafka streams clusters you often can have multiple rebalances as the > consumer coordinator sees nodes come on or go offline. > > This is problematic since this means that for a brief period, the consumer > group and kafka streams begins bootstrapping state stores and/or starting > processors with a mis-allocated set of tasks. For example, if we have 10 > nodes, and we perform a (relatively) simultaneous graceful shutdown, what > can happen is that 8 of the 10 nodes signal shutdown and 2 of them have a > rebalance triggered before they are finished shutting down. This can result > in a brief task re-allocation, state store re-initialization and > materialization, and so on, but with all of the partition topics and tasks > across the job assigned to just two nodes, resulting in excessive load and > delays in restarting. > > One of the ways to solve this problem that may be generally useful would > be to add a consumer configuration that declares the minimum number of > consumers in the group that must be available for the rebalancing process > to complete (along with a timeout.) This could help in a number of ways: > > - Properly dealing with startup/shutdown rebalances > - Capping the # of partition topic assignments to a specific consumer > - Preventing resource saturation both at the consumer level or cascading > downstream > - Providing a meaningful event to alert on if there is a capacity deficit > > I think for this to be done properly it is probably something the > coordinator would need to take care of, but one could also imagine a > version where the consumers themselves defer consumption unless the group > constraints are satisfied. A v2 could also potentially introduce other > invariants that must be true for a consumer to begin consumption. > > Would love to know if there is already a better way to solve this problem, > of course! >