[ https://issues.apache.org/jira/browse/FLINK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rafael Zimmermann updated FLINK-37393: -------------------------------------- Description: We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. h4. Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration h4. Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} h4. Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns h4. Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling h4. Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours h4. Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? was: We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. ### Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration ### Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} ### Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns ### Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling ### Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours ### Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? > Abnormally Long Checkpoint Duration After Full Checkpoint Completion > -------------------------------------------------------------------- > > Key: FLINK-37393 > URL: https://issues.apache.org/jira/browse/FLINK-37393 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.18.0 > Environment: Apache Flink 1.18.0 > Pipeline processing ~5.7TB of checkpointed data > Using GCS for checkpoint storage > Reporter: Rafael Zimmermann > Priority: Major > Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot > 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, > evidence.log > > > We're observing an issue where checkpoints following a full checkpoint take > an unusually long time to complete (1-2 hours) in our Flink pipeline, while > normal checkpoints typically complete within seconds/minutes. > h4. Observed Behavior: > - After a full checkpoint completes, the next incremental checkpoint shows > extremely high start delay > - Normal checkpoints take ~30 seconds to complete > - Full checkpoints take ~7 minutes to complete > - The problematic checkpoint after full checkpoint takes 1-2 hours > - The start delay appears to be the main contributing factor to the long > duration > h4. Logs and Evidence: > Full checkpoint logs showing significant gaps: > {code:java} > {"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering > a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} > ... > {"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint > > Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint > request time in queue: 42663"}{code} > > h4. Impact: > - Affects pipeline reliability > - No data loss observed, but creates operational concerns > h4. Potential Causes: > - Possible locking mechanism bug in Flink internals > - Issues with queued checkpoint requests > - Interaction between full and incremental checkpoint scheduling > h4. Attempted Workarounds: > - Manually triggering checkpoints > - Normal checkpoint operations resume after several hours > h4. Questions: > Is this expected behavior when full and incremental checkpoints interact? > Are there known issues with checkpoint request queuing? > Are there configuration parameters that could help mitigate this? > -- This message was sent by Atlassian Jira (v8.20.10#820010)