Roman Khachatryan created FLINK-36733:
-----------------------------------------
Summary: Don't transition task to RUNNING until the inputs are
recovered (UC)
Key: FLINK-36733
URL: https://issues.apache.org/jira/browse/FLINK-36733
Project: Flink
Issue Type: Improvement
Components: Runtime / Task
Affects Versions: 1.19.1, 1.20.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
Fix For: 1.19.2, 1.20.1
When recovering from an Unaligned Checkpoint, a task transitions to RUNNING
after restoring:
# Output channel state
# Operator state
# Input channel state
However, the upstream task(s) might not yet send all the recovered buffers;
therefore, in case of rescaling, downstream task must keep the virtual channel
infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}
{{}}
That means in particular that checkpoints might be triggered by the
`CheckpointCoordinator` but declined by the downstream task (because
{{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).
In case of long recovery, many declined checkpoints might exhaust some
resources, e.g. transaction ID pools in our case.
It's confusing (for humans and observability tools) to see tasks switched to
RUNNING but still not able to checkpoint due to recovery.
The proposal is to transition task to RUNNING only after all the inputs are
recovered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)