[ https://issues.apache.org/jira/browse/FLINK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744702#comment-17744702 ]
Danny Cranmer commented on FLINK-32469: --------------------------------------- Merged commit [{{7b9b4e5}}|https://github.com/apache/flink/commit/7b9b4e53a59ab8f4f2a99a6e162a794d264f7daf] into master > Improve checkpoint REST APIs for programmatic access > ---------------------------------------------------- > > Key: FLINK-32469 > URL: https://issues.apache.org/jira/browse/FLINK-32469 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST > Affects Versions: 1.16.2, 1.17.1 > Reporter: Hong Liang Teoh > Assignee: Hong Liang Teoh > Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > > *Why* > We want to enable programmatic use of the checkpoints REST API, independent > of the Flink dashboard. > Currently, REST APIs that retrieve information relating to a given Flink job > passes through the {{{}ExecutionGraphCache{}}}. This means that all these > APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, > which defaults to 3s. For programmatic use of the REST API, we should be able > to retrieve the latest / cached version depending on the client (Flink > dashboard gets the cached version, other clients get the updated version). > For example, a user might want to use the REST API to retrieve the latest > completed checkpoint for a given Flink job. This might be useful when trying > to use existing checkpoints as state store when migrating a Flink job from > one cluster to another. See Appendix for example. > *What* > This change is about separating out the cache used for the checkpoints REST > APIs to a separate cache. This way, a user can set the timeout for the > checkpoints cache to 0s (disable cache), without causing much effect on the > user experience on the Flink dashboard. > In addition, the checkpoint handlers first retrieve the > {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from > the graph. This is not needed, since the checkpoint handlers only need the > {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve > the minimal required information ({{{}CheckpointStatsSnapshot){}}} to > construct a reply. > > *Example use case* > When performing security patching / maintenance of the infrastructure > supporting the Flink cluster, we might want to transfer a given Flink job to > another cluster, whilst maintaining state. We can do this via the below steps: > # Old cluster - Select completed checkpoint on existing Flink job > # Old cluster - Stop the existing Flink job > # New cluster - Start a new Flink job with selected checkpoint > Step 1 requires us to query the checkpoints REST API for the latest completed > checkpoint. With the status quo, we need to wait 3s (or whatever the > ExecutionGraphCache expiry may be). This is undesirable because this means > the Flink job will have to reprocess data equivalent to 3s / whatever the > execution graph cache timeout is. -- This message was sent by Atlassian Jira (v8.20.10#820010)