Roman Khachatryan created FLINK-36733:
-----------------------------------------

             Summary: Don't transition task to RUNNING until the inputs are 
recovered (UC)
                 Key: FLINK-36733
                 URL: https://issues.apache.org/jira/browse/FLINK-36733
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Task
    Affects Versions: 1.19.1, 1.20.0
            Reporter: Roman Khachatryan
            Assignee: Roman Khachatryan
             Fix For: 1.19.2, 1.20.1


When recovering from an Unaligned Checkpoint, a task transitions to RUNNING 
after restoring:
 # Output channel state
 # Operator state
 # Input channel state 

However, the upstream task(s) might not yet send all the recovered buffers; 
therefore, in case of rescaling, downstream task must keep the virtual channel 
infrastructure up ({{{}RescalingStreamTaskNetworkInput).{}}}

{{}}

That means in particular that checkpoints might be triggered by the 
`CheckpointCoordinator` but declined by the downstream task (because 
{{RescalingStreamTaskNetworkInput}} doesn't support checkpointing).

 

In case of long recovery, many declined checkpoints might exhaust some 
resources, e.g. transaction ID pools in our case.

It's confusing (for humans and observability tools) to see tasks switched to 
RUNNING but still not able to checkpoint due to recovery.

 

The proposal is to transition task to RUNNING only after all the inputs are 
recovered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to