Piotr Nowojski created FLINK-37256:
--------------------------------------

             Summary: Firing timers can block recovery process
                 Key: FLINK-37256
                 URL: https://issues.apache.org/jira/browse/FLINK-37256
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing, Runtime / Task
    Affects Versions: 1.20.0, 2.0.0
            Reporter: Piotr Nowojski


Splitable/interruptible timers for checkpointnig were introduced in FLINK-20217 
as part of the 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-443%3A+Interruptible+timers+firing
 .

However the exact same problem can happen during recovery. Usually (only?) due 
to a watermark that was caught along the in-flight data, that is being 
processed during a subtask's "INITIALIZATION" phase. The problem is now that 
while we are in the initialization phase, job can not perform any checkpoints. 
This issue is compounded if there is some data multiplication operator in the 
pipeline, downstream from the operator that has a lot of timers to fire. What 
can happen then is:
* some upstream operator A is firing a lot of timers, that produce a lot of 
data (for example 100 000 records) while it's still INITIALIZING
* those records are multiplied downstream (operators B, C, ...) by for example 
factor of 100x
* in the end, sinks have to accept ~100 000 * 100 records before that upstream 
operator A can finish processing in-flight data and switch to RUNNING

This can take hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to