[jira] [Commented] (KAFKA-18168) GlobalKTable does not checkpoint restored offsets until next 10K events

Janindu Pathirana (Jira) Wed, 29 Jan 2025 08:16:55 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922137#comment-17922137
 ]


Janindu Pathirana commented on KAFKA-18168:
-------------------------------------------

Hi [~mjsax],

Explanation is clear. Calling the `GlobalStateUpdateTask.flushState()` method 
during `GlobalStateUpdateTask.close()` will checkpoint when the instance when 
closing.

What I really wanted to clarify was, the checkpointing logic when the 
application is running. Like if we change the condition in 
`GlobalStateUpdateTask.maybeCheckpoint()` to an OR condition(ie, currently it 
has an AND condition which checks for both the 10k events and a flush interval 
which is the COMMIT_INTERVAL_MS_CONFIG), we can achieve the periodic 
checkpointing. But then the question that I have is, if checkpointing to a time 
interval without any activity is efficient or not? 

So yeah basically, checkpointing when restoring and closing is a must, and what 
I additionally wanted to clarify was whether I should implement periodic 
checkpointing as well or if checkpointing only during restoring and closing is 
enough. Would be great if you can clarify this for me.

Thank you! 

> GlobalKTable does not checkpoint restored offsets until next 10K events
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-18168
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18168
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 3.4.1, 3.8.1
>            Reporter: Sergey Zyrianov
>            Assignee: Janindu Pathirana
>            Priority: Minor
>
> As in https://issues.apache.org/jira/browse/KAFKA-5241, there is a state of 
> considerable size kept on a topic that backs up GlobalKTalbe. Restoring 
> GlobalKTable takes minutes before it is operational. After successful restore 
> the checkpoint file is not created until further 10K events happen on the 
> topic. 
> The following scenario illustrates the issue:
>  # {*}Scaling Out{*}: When a new instance (e.g., pod X) is added to an 
> already running set of instances (pods 0...X-1), the new instance will 
> restore the state successfully. However, it will not create a checkpoint file 
> until 10K events are processed on the {{GlobalKTable}} topic.
>  # {*}Lack of Traffic{*}: If there is no new traffic on the {{GlobalKTable}} 
> topic, there is no mechanism to force the creation of the checkpoint file. 
> The state remains uncheckpointed. Ref 
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L78C35-L78C72]
>  # {*}Instance Restart{*}: If the new instance (pod X) is restarted (due to 
> update for ex) before 10K events have been processed, it will have to restore 
> the entire state from the topic again, leading to the same time-consuming 
> restoration process. This issue persists across restarts.
> IMO, checkpointing during the restore process and upon completion/close is 
> missing in the current implementation
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18168) GlobalKTable does not checkpoint restored offsets until next 10K events

Reply via email to