[ https://issues.apache.org/jira/browse/FLINK-36753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911326#comment-17911326 ]
Samrat Deb edited comment on FLINK-36753 at 1/9/25 5:15 AM: ------------------------------------------------------------ I have deep-dived into the requirements and feasibility of improvement and had a one-on-one offline discussion with [~fanrui]. In brief, here are the main open questions and things to consider when it comes to active triggering checkpoints during rescaling: Open Questions: 1. Would extending active checkpoint triggering to downscaling also be appropriate? While downscaling would not require waiting for additional resources, active checkpoint triggering will ensure faster release of resources. 2. Should we respect the {code:java} execution.checkpointing.min-pause{code} configuration when actively triggering a checkpoint for rescaling? My perspective: `execution.checkpointing.min-pause` was introduced to serve the purpose that Flink jobs actually run around being able to process data as opposed to being so heavily involved in fault-tolerant related activities. The introduction of active checkpoint triggering aligns with scenarios where resources are ready, and triggering a checkpoint can lead to increased parallelism. With higher parallelism, jobs will eventually process data more efficiently. Active triggering during the downscale process will release resources much earlier and lead to efficient resource usage by Flink. Ignoring `execution.checkpointing.min-pause` brings evident performance benefits for such cases. But then, on the contrary, executing according to `execution.checkpointing.min-pause` strictly adheres to user-defined configurations that may cause delays in a situation where active triggering would be beneficial. Should performance boosts in the specific scenarios set foot over adherence to user-specified checkpointing configurations? 3. In case, there is a checkpoint already in process and {code:java} execution.checkpointing.max-concurrent-checkpoints{code} allows further concurrent checkpoints, would you prefer to utilize the available space for actively commencing a new checkpoint to improve the rescaling process? Otherwise, there might be a chance that it would be assumed that the current checkpoint is dealing with rescaling, and no further action will be taken. [~fanrui] [~mxm] Thoughts? was (Author: samrat007): I have deep-dived into the requirements and feasibility of improvement and had a one-on-one offline discussion with [~fanrui]. In brief, here are the main open questions and things to consider when it comes to active triggering checkpoints during rescaling: Open Questions: 1. Would extending active checkpoint triggering to downscaling also be appropriate? While downscaling would not require waiting for additional resources, active checkpoint triggering will ensure faster release of resources. 2. Should we respect the `execution.checkpointing.min-pause` configuration when actively triggering a checkpoint for rescaling? My perspective: `execution.checkpointing.min-pause` was introduced to serve the purpose that Flink jobs actually run around being able to process data as opposed to being so heavily involved in fault-tolerant related activities. The introduction of active checkpoint triggering aligns with scenarios where resources are ready, and triggering a checkpoint can lead to increased parallelism. With higher parallelism, jobs will eventually process data more efficiently. Active triggering during the downscale process will release resources much earlier and lead to efficient resource usage by Flink. Ignoring `execution.checkpointing.min-pause` brings evident performance benefits for such cases. But then, on the contrary, executing according to `execution.checkpointing.min-pause` strictly adheres to user-defined configurations that may cause delays in a situation where active triggering would be beneficial. Should performance boosts in the specific scenarios set foot over adherence to user-specified checkpointing configurations? 3. In case, there is a checkpoint already in process and `execution.checkpointing.max-concurrent-checkpoints` allows further concurrent checkpoints, would you prefer to utilize the available space for actively commencing a new checkpoint to improve the rescaling process? Otherwise, there might be a chance that it would be assumed that the current checkpoint is dealing with rescaling, and no further action will be taken. [~fanrui] [~mxm] Thoughts? > Adaptive Scheduler actively triggers a Checkpoint > ------------------------------------------------- > > Key: FLINK-36753 > URL: https://issues.apache.org/jira/browse/FLINK-36753 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 2.0-preview > Reporter: Rui Fan > Assignee: Samrat Deb > Priority: Major > > FLIP-461[1] and FLINK-35549[2] support that rescale could be executed after > the next completed checkpoint. It greatly reduces the amount of data replay > after rescale. > In FLIP-461, Adaptive Scheduler waits for the next periodic checkpoint to be > triggered. In most scenarios, a more efficient solution might be Adaptive > Scheduler actively triggers a Checkpoint after all resources are > ready(Technically desire resources are ready). > The idea comes from an offline discussion between [~mxm] and [~fanrui]. > [1][https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] > [2] https://issues.apache.org/jira/browse/FLINK-35549 -- This message was sent by Atlassian Jira (v8.20.10#820010)