[ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410010#comment-17410010 ]
Sagar Rao commented on KAFKA-12550: ----------------------------------- [~ableegoldman]/ [~mjsax], Thanks. it makes sense now. I liked the idea of removing states PARTITIONS_ASSIGNED/PARTITIONS_REVOKED from StreamThread and adding REBALANCING/RESTORING due to cooperative rebalancing. I also agree that REBALANCING should take precedence over RESTORING. > Introduce RESTORING state to the KafkaStreams FSM > ------------------------------------------------- > > Key: KAFKA-12550 > URL: https://issues.apache.org/jira/browse/KAFKA-12550 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: A. Sophie Blee-Goldman > Assignee: Sagar Rao > Priority: Major > Labels: needs-kip > Fix For: 4.0.0 > > > We should consider adding a new state to the KafkaStreams FSM: RESTORING > This would cover the time between the completion of a stable rebalance and > the completion of restoration across the client. Currently, Streams will > report the state during this time as REBALANCING even though it is generally > spending much more time restoring than rebalancing in most cases. > There are a few motivations/benefits behind this idea: > # Observability is a big one: using the umbrella REBALANCING state to cover > all aspects of rebalancing -> task initialization -> restoring has been a > common source of confusion in the past. It’s also proved to be a time sink > for us, during escalations, incidents, mailing list questions, and bug > reports. It often adds latency to escalations in particular as we have to go > through GTS and wait for the customer to clarify whether their “Kafka Streams > is stuck rebalancing” ticket means that it’s literally rebalancing, or just > in the REBALANCING state and actually stuck elsewhere in Streams > # Prereq for global thread improvements: for example [KIP-406: > GlobalStreamThread should honor custom reset policy > |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] > was ultimately blocked on this as we needed to pause the Streams app while > the global thread restored from the appropriate offset. Since there’s > absolutely no rebalancing involved in this case, piggybacking on the > REBALANCING state would just be shooting ourselves in the foot. -- This message was sent by Atlassian Jira (v8.3.4#803005)