[ https://issues.apache.org/jira/browse/FLINK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903082#comment-17903082 ]
Alexander Fedulov commented on FLINK-36733: ------------------------------------------- [~roman] I am working on preparing the 1.19.2 and 1.20.1 releases. Do you have anything in progress that you think we can reasonable get into these patch releases? Otherwise I would bump the fix versions to 1.19.3 and 1.20.2. > Don't transition task to RUNNING until the inputs are recovered (UC) > -------------------------------------------------------------------- > > Key: FLINK-36733 > URL: https://issues.apache.org/jira/browse/FLINK-36733 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task > Affects Versions: 1.20.0, 1.19.1 > Reporter: Roman Khachatryan > Assignee: Roman Khachatryan > Priority: Major > Fix For: 1.19.2, 1.20.1 > > > When recovering from an Unaligned Checkpoint, a task transitions to RUNNING > after restoring: > # Output channel state > # Operator state > # Input channel state > However, the upstream task(s) might not yet send all the recovered buffers; > therefore, in case of rescaling, downstream task must keep the virtual > channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}} > {{}} > That means in particular that checkpoints might be triggered by the > `CheckpointCoordinator` but declined by the downstream task (because > {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing). > > In case of long recovery, many declined checkpoints might exhaust some > resources, e.g. transaction ID pools in our case. > It's confusing (for humans and observability tools) to see tasks switched to > RUNNING but still not able to checkpoint due to recovery. > > The proposal is to transition task to RUNNING only after all the inputs are > recovered. -- This message was sent by Atlassian Jira (v8.20.10#820010)