Sergey Zyrianov created KAFKA-18168: ---------------------------------------
Summary: GlobalKTable does not checkpoint offsets until next 10K events Key: KAFKA-18168 URL: https://issues.apache.org/jira/browse/KAFKA-18168 Project: Kafka Issue Type: Improvement Components: streams Affects Versions: 3.8.1, 3.4.1 Reporter: Sergey Zyrianov As in https://issues.apache.org/jira/browse/KAFKA-5241, there is a state of considerable size kept on a topic that backs up GlobalKTalbe. Restoring GlobalKTable takes minutes before it is operational. After successful restore the checkpoint file is not created until further 10K events happen on the topic. The following scenario illustrates the issue: # {*}Scaling Out{*}: When a new instance (e.g., pod X) is added to an already running set of instances (pods 0...X-1), the new instance will restore the state successfully. However, it will not create a checkpoint file until 10K events are processed on the {{GlobalKTable}} topic. # {*}Lack of Traffic{*}: If there is no new traffic on the {{GlobalKTable}} topic, there is no mechanism to force the creation of the checkpoint file. The state remains uncheckpointed. Ref [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L78C35-L78C72] # {*}Instance Restart{*}: If the new instance (pod X) is restarted (due to update for ex) before 10K events have been processed, it will have to restore the entire state from the topic again, leading to the same time-consuming restoration process. This issue persists across restarts. IMO, checkpointing during the restore process and upon completion/close is missing in the current implementation -- This message was sent by Atlassian Jira (v8.20.10#820010)