[jira] [Commented] (FLINK-36733) Don't transition task to RUNNING until the inputs are recovered (UC)

Alexander Fedulov (Jira) Wed, 04 Dec 2024 10:39:11 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903082#comment-17903082
 ]


Alexander Fedulov commented on FLINK-36733:
-------------------------------------------

[~roman] I am working on preparing the 1.19.2 and 1.20.1 releases. Do you have 
anything in progress that you think we can reasonable get into these patch 
releases? Otherwise I would bump the fix versions to 1.19.3 and 1.20.2.

> Don't transition task to RUNNING until the inputs are recovered (UC)
> --------------------------------------------------------------------
>
>                 Key: FLINK-36733
>                 URL: https://issues.apache.org/jira/browse/FLINK-36733
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.20.0, 1.19.1
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Major
>             Fix For: 1.19.2, 1.20.1
>
>
> When recovering from an Unaligned Checkpoint, a task transitions to RUNNING 
> after restoring:
>  # Output channel state
>  # Operator state
>  # Input channel state 
> However, the upstream task(s) might not yet send all the recovered buffers; 
> therefore, in case of rescaling, downstream task must keep the virtual 
> channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}
> {{}}
> That means in particular that checkpoints might be triggered by the 
> `CheckpointCoordinator` but declined by the downstream task (because 
> {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).
>  
> In case of long recovery, many declined checkpoints might exhaust some 
> resources, e.g. transaction ID pools in our case.
> It's confusing (for humans and observability tools) to see tasks switched to 
> RUNNING but still not able to checkpoint due to recovery.
>  
> The proposal is to transition task to RUNNING only after all the inputs are 
> recovered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36733) Don't transition task to RUNNING until the inputs are recovered (UC)

Reply via email to