[ https://issues.apache.org/jira/browse/FLINK-37375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zakelly Lan updated FLINK-37375: -------------------------------- Issue Type: New Feature (was: Bug) > Checkpoint supports the Operator to customize asynchronous snapshot state > ------------------------------------------------------------------------- > > Key: FLINK-37375 > URL: https://issues.apache.org/jira/browse/FLINK-37375 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing > Affects Versions: 1.20.1 > Reporter: Jufang He > Priority: Major > Labels: pull-request-available > > In some Flink task operators, slow operations such as file uploads or data > flushing may be performed during the synchronous phase of Checkpoint. Due to > performance issues with external storage components, the synchronous phase > may take too long to execute, significantly impacting the job's throughput. > For example, during our internal use of Paimon, we observed that uploading > files to HDFS during the Checkpoint synchronous phase could encounter random > HDFS slow node issues, leading to a substantial negative impact on task > throughput. > To address this issue, I propose supporting a generic operator custom > asynchronous snapshot feature, allowing users to move time-consuming logic to > the asynchronous phase of Checkpoint, thereby minimizing the blocking of the > main thread and improving task throughput. For instance, the Paimon writer > operator could write data locally during the Checkpoint synchronous phase and > upload files to remote storage during the asynchronous phase. Beyond the > Paimon data upload scenario, other operator logic may also experience slow > execution during the synchronous phase. This approach aims to uniformly > optimize such issues. > I drafted a flip for this issue: > [https://docs.google.com/document/d/1lwxLEQjD6jVhZUBMRGhzQNWKSvdbPbYNQsV265gR4kw] > -- This message was sent by Atlassian Jira (v8.20.10#820010)