Piotr Nowojski created FLINK-37256: -------------------------------------- Summary: Firing timers can block recovery process Key: FLINK-37256 URL: https://issues.apache.org/jira/browse/FLINK-37256 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing, Runtime / Task Affects Versions: 1.20.0, 2.0.0 Reporter: Piotr Nowojski
Splitable/interruptible timers for checkpointnig were introduced in FLINK-20217 as part of the https://cwiki.apache.org/confluence/display/FLINK/FLIP-443%3A+Interruptible+timers+firing . However the exact same problem can happen during recovery. Usually (only?) due to a watermark that was caught along the in-flight data, that is being processed during a subtask's "INITIALIZATION" phase. The problem is now that while we are in the initialization phase, job can not perform any checkpoints. This issue is compounded if there is some data multiplication operator in the pipeline, downstream from the operator that has a lot of timers to fire. What can happen then is: * some upstream operator A is firing a lot of timers, that produce a lot of data (for example 100 000 records) while it's still INITIALIZING * those records are multiplied downstream (operators B, C, ...) by for example factor of 100x * in the end, sinks have to accept ~100 000 * 100 records before that upstream operator A can finish processing in-flight data and switch to RUNNING This can take hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)