Roman Khachatryan created FLINK-36733: -----------------------------------------
Summary: Don't transition task to RUNNING until the inputs are recovered (UC) Key: FLINK-36733 URL: https://issues.apache.org/jira/browse/FLINK-36733 Project: Flink Issue Type: Improvement Components: Runtime / Task Affects Versions: 1.19.1, 1.20.0 Reporter: Roman Khachatryan Assignee: Roman Khachatryan Fix For: 1.19.2, 1.20.1 When recovering from an Unaligned Checkpoint, a task transitions to RUNNING after restoring: # Output channel state # Operator state # Input channel state However, the upstream task(s) might not yet send all the recovered buffers; therefore, in case of rescaling, downstream task must keep the virtual channel infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}} {{}} That means in particular that checkpoints might be triggered by the `CheckpointCoordinator` but declined by the downstream task (because {{RescalingStreamTaskNetworkInput}} doesn't support checkpointing). In case of long recovery, many declined checkpoints might exhaust some resources, e.g. transaction ID pools in our case. It's confusing (for humans and observability tools) to see tasks switched to RUNNING but still not able to checkpoint due to recovery. The proposal is to transition task to RUNNING only after all the inputs are recovered. -- This message was sent by Atlassian Jira (v8.20.10#820010)