A. Sophie Blee-Goldman created KAFKA-12550:
----------------------------------------------

             Summary: Introduce RESTORING state to the KafkaStreams FSM
                 Key: KAFKA-12550
                 URL: https://issues.apache.org/jira/browse/KAFKA-12550
             Project: Kafka
          Issue Type: Improvement
          Components: streams
            Reporter: A. Sophie Blee-Goldman
            Assignee: A. Sophie Blee-Goldman
             Fix For: 3.0.0


We should consider adding a new state to the KafkaStreams FSM: RESTORING

This would cover the time between the completion of a stable rebalance and the 
completion of restoration across the client. Currently, Streams will report the 
state during this time as REBALANCING even though it is generally spending much 
more time restoring than rebalancing in most cases.

There are a few motivations/benefits behind this idea:

# Observability is a big one: using the umbrella REBALANCING state to cover all 
aspects of rebalancing -> task initialization -> restoring has been a common 
source of confusion in the past. It’s also proved to be a time sink for us, 
during escalations, incidents, mailing list questions, and bug reports. It 
often adds latency to escalations in particular as we have to go through GTS 
and wait for the customer to clarify whether their “Kafka Streams is stuck 
rebalancing” ticket means that it’s literally rebalancing, or just in the 
REBALANCING state and actually stuck elsewhere in Streams
# Prereq for global thread improvements: for example [KIP-406: 
GlobalStreamThread should honor custom reset policy 
|https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy]
 was ultimately blocked on this as we needed to pause the Streams app while the 
global thread restored from the appropriate offset. Since there’s absolutely no 
rebalancing involved in this case, piggybacking on the REBALANCING state would 
just be shooting ourselves in the foot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to