[ https://issues.apache.org/jira/browse/KAFKA-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922137#comment-17922137 ]
Janindu Pathirana commented on KAFKA-18168: ------------------------------------------- Hi [~mjsax], Explanation is clear. Calling the `GlobalStateUpdateTask.flushState()` method during `GlobalStateUpdateTask.close()` will checkpoint when the instance when closing. What I really wanted to clarify was, the checkpointing logic when the application is running. Like if we change the condition in `GlobalStateUpdateTask.maybeCheckpoint()` to an OR condition(ie, currently it has an AND condition which checks for both the 10k events and a flush interval which is the COMMIT_INTERVAL_MS_CONFIG), we can achieve the periodic checkpointing. But then the question that I have is, if checkpointing to a time interval without any activity is efficient or not? So yeah basically, checkpointing when restoring and closing is a must, and what I additionally wanted to clarify was whether I should implement periodic checkpointing as well or if checkpointing only during restoring and closing is enough. Would be great if you can clarify this for me. Thank you! > GlobalKTable does not checkpoint restored offsets until next 10K events > ----------------------------------------------------------------------- > > Key: KAFKA-18168 > URL: https://issues.apache.org/jira/browse/KAFKA-18168 > Project: Kafka > Issue Type: Improvement > Components: streams > Affects Versions: 3.4.1, 3.8.1 > Reporter: Sergey Zyrianov > Assignee: Janindu Pathirana > Priority: Minor > > As in https://issues.apache.org/jira/browse/KAFKA-5241, there is a state of > considerable size kept on a topic that backs up GlobalKTalbe. Restoring > GlobalKTable takes minutes before it is operational. After successful restore > the checkpoint file is not created until further 10K events happen on the > topic. > The following scenario illustrates the issue: > # {*}Scaling Out{*}: When a new instance (e.g., pod X) is added to an > already running set of instances (pods 0...X-1), the new instance will > restore the state successfully. However, it will not create a checkpoint file > until 10K events are processed on the {{GlobalKTable}} topic. > # {*}Lack of Traffic{*}: If there is no new traffic on the {{GlobalKTable}} > topic, there is no mechanism to force the creation of the checkpoint file. > The state remains uncheckpointed. Ref > [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L78C35-L78C72] > # {*}Instance Restart{*}: If the new instance (pod X) is restarted (due to > update for ex) before 10K events have been processed, it will have to restore > the entire state from the topic again, leading to the same time-consuming > restoration process. This issue persists across restarts. > IMO, checkpointing during the restore process and upon completion/close is > missing in the current implementation > -- This message was sent by Atlassian Jira (v8.20.10#820010)