With Kafka Streams, it's common to spin up and shut down clusters of consumers by performing a graceful shutdown and restart during a deploy. One thing we've been running into is that during the startup and shutdown of a kafka streams clusters you often can have multiple rebalances as the consumer coordinator sees nodes come on or go offline.
This is problematic since this means that for a brief period, the consumer group and kafka streams begins bootstrapping state stores and/or starting processors with a mis-allocated set of tasks. For example, if we have 10 nodes, and we perform a (relatively) simultaneous graceful shutdown, what can happen is that 8 of the 10 nodes signal shutdown and 2 of them have a rebalance triggered before they are finished shutting down. This can result in a brief task re-allocation, state store re-initialization and materialization, and so on, but with all of the partition topics and tasks across the job assigned to just two nodes, resulting in excessive load and delays in restarting. One of the ways to solve this problem that may be generally useful would be to add a consumer configuration that declares the minimum number of consumers in the group that must be available for the rebalancing process to complete (along with a timeout.) This could help in a number of ways: - Properly dealing with startup/shutdown rebalances - Capping the # of partition topic assignments to a specific consumer - Preventing resource saturation both at the consumer level or cascading downstream - Providing a meaningful event to alert on if there is a capacity deficit I think for this to be done properly it is probably something the coordinator would need to take care of, but one could also imagine a version where the consumers themselves defer consumption unless the group constraints are satisfied. A v2 could also potentially introduce other invariants that must be true for a consumer to begin consumption. Would love to know if there is already a better way to solve this problem, of course!