Sergey Zyrianov created KAFKA-18168:
---------------------------------------

             Summary: GlobalKTable does not checkpoint offsets until next 10K 
events
                 Key: KAFKA-18168
                 URL: https://issues.apache.org/jira/browse/KAFKA-18168
             Project: Kafka
          Issue Type: Improvement
          Components: streams
    Affects Versions: 3.8.1, 3.4.1
            Reporter: Sergey Zyrianov


As in https://issues.apache.org/jira/browse/KAFKA-5241, there is a state of 
considerable size kept on a topic that backs up GlobalKTalbe. Restoring 
GlobalKTable takes minutes before it is operational. After successful restore 
the checkpoint file is not created until further 10K events happen on the 
topic. 



The following scenario illustrates the issue:
 # {*}Scaling Out{*}: When a new instance (e.g., pod X) is added to an already 
running set of instances (pods 0...X-1), the new instance will restore the 
state successfully. However, it will not create a checkpoint file until 10K 
events are processed on the {{GlobalKTable}} topic.

 # {*}Lack of Traffic{*}: If there is no new traffic on the {{GlobalKTable}} 
topic, there is no mechanism to force the creation of the checkpoint file. The 
state remains uncheckpointed. Ref 
[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L78C35-L78C72]

 # {*}Instance Restart{*}: If the new instance (pod X) is restarted (due to 
update for ex) before 10K events have been processed, it will have to restore 
the entire state from the topic again, leading to the same time-consuming 
restoration process. This issue persists across restarts.

IMO, checkpointing during the restore process and upon completion/close is 
missing in the current implementation

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to