Rafael Zimmermann created FLINK-37393: -----------------------------------------
Summary: Abnormally Long Checkpoint Duration After Full Checkpoint Completion Key: FLINK-37393 URL: https://issues.apache.org/jira/browse/FLINK-37393 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.18.0 Environment: Apache Flink 1.18.0 Pipeline processing ~5.7TB of checkpointed data Using GCS for checkpoint storage Reporter: Rafael Zimmermann Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, evidence.log We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. ### Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration ### Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} ### Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns ### Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling ### Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours ### Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? -- This message was sent by Atlassian Jira (v8.20.10#820010)