[ 
https://issues.apache.org/jira/browse/FLINK-36753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911326#comment-17911326
 ] 

Samrat Deb commented on FLINK-36753:
------------------------------------

I have deep-dived into the requirements and feasibility of improvement and had 
a one-on-one offline discussion with [~fanrui]. In brief, here are the main 
open questions and things to consider when it comes to active triggering 
checkpoints during rescaling:

Open Questions:

1. Would extending active checkpoint triggering to downscaling also be 
appropriate? While downscaling would not require waiting for additional 
resources, active checkpoint triggering will ensure faster release of resources.


2. Should we respect the `execution.checkpointing.min-pause` configuration when 
actively triggering a checkpoint for rescaling?

My perspective:

`execution.checkpointing.min-pause` was introduced to serve the purpose that 
Flink jobs actually run around being able to process data as opposed to being 
so heavily involved in fault-tolerant related activities. The introduction of 
active checkpoint triggering aligns with scenarios where resources are ready, 
and triggering a checkpoint can lead to increased parallelism. With higher 
parallelism, jobs will eventually process data more efficiently.
Active triggering during the downscale process will release resources much 
earlier and lead to efficient resource usage by Flink. Ignoring 
`execution.checkpointing.min-pause` brings evident performance benefits for 
such cases.
But then, on the contrary, executing according to 
`execution.checkpointing.min-pause` strictly adheres to user-defined 
configurations that may cause delays in a situation where active triggering 
would be beneficial.

Should performance boosts in the specific scenarios set foot over adherence to 
user-specified checkpointing configurations?


3. In case, there is a checkpoint already in process and 
`execution.checkpointing.max-concurrent-checkpoints` allows further concurrent 
checkpoints, would you prefer to utilize the available space for actively 
commencing a new checkpoint to improve the rescaling process? Otherwise, there 
might be a chance that it would be assumed that the current checkpoint is 
dealing with rescaling, and no further action will be taken.

 

[~fanrui] [~mxm] Thoughts?

> Adaptive Scheduler actively triggers a Checkpoint
> -------------------------------------------------
>
>                 Key: FLINK-36753
>                 URL: https://issues.apache.org/jira/browse/FLINK-36753
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Rui Fan
>            Assignee: Samrat Deb
>            Priority: Major
>
> FLIP-461[1] and FLINK-35549[2] support that rescale could be executed after 
> the next completed checkpoint. It greatly reduces the amount of data replay 
> after rescale.
> In FLIP-461, Adaptive Scheduler waits for the next periodic checkpoint to be 
> triggered. In most scenarios, a more efficient solution might be Adaptive 
> Scheduler actively triggers a Checkpoint after all resources are 
> ready(Technically desire resources are ready).
> The idea comes from an offline discussion between [~mxm]  and [~fanrui].
> [1][https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
> [2] https://issues.apache.org/jira/browse/FLINK-35549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to