[jira] [Updated] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion

Rafael Zimmermann (Jira) Wed, 26 Feb 2025 04:21:07 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rafael Zimmermann updated FLINK-37393:
--------------------------------------
    Description: 
We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.
h4. Observed Behavior:
 - After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
 - Normal checkpoints take ~30 seconds to complete
 - Full checkpoints take ~7 minutes to complete
 - The problematic checkpoint after full checkpoint takes 1-2 hours
 - The start delay appears to be the main contributing factor to the long 
duration

h4. Logs and Evidence:

Full checkpoint logs showing significant gaps:
{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 
h4. Impact:
 - Affects pipeline reliability
 - No data loss observed, but creates operational concerns

h4. Potential Causes:
 - Possible locking mechanism bug in Flink internals
 - Issues with queued checkpoint requests
 - Interaction between full and incremental checkpoint scheduling

h4. Attempted Workarounds:
 - Manually triggering checkpoints
 - Normal checkpoint operations resume after several hours

h4. Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 

  was:
We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.



### Observed Behavior:

- After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
- Normal checkpoints take ~30 seconds to complete
- Full checkpoints take ~7 minutes to complete
- The problematic checkpoint after full checkpoint takes 1-2 hours
- The start delay appears to be the main contributing factor to the long 
duration

### Logs and Evidence:

Full checkpoint logs showing significant gaps:


{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 

### Impact:

- Affects pipeline reliability
- No data loss observed, but creates operational concerns



### Potential Causes:

- Possible locking mechanism bug in Flink internals
- Issues with queued checkpoint requests
- Interaction between full and incremental checkpoint scheduling

### Attempted Workarounds:

- Manually triggering checkpoints
- Normal checkpoint operations resume after several hours

### Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 


> Abnormally Long Checkpoint Duration After Full Checkpoint Completion
> --------------------------------------------------------------------
>
>                 Key: FLINK-37393
>                 URL: https://issues.apache.org/jira/browse/FLINK-37393
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.0
>         Environment: Apache Flink 1.18.0
> Pipeline processing ~5.7TB of checkpointed data
> Using GCS for checkpoint storage
>            Reporter: Rafael Zimmermann
>            Priority: Major
>         Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 
> 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, 
> evidence.log
>
>
> We're observing an issue where checkpoints following a full checkpoint take 
> an unusually long time to complete (1-2 hours) in our Flink pipeline, while 
> normal checkpoints typically complete within seconds/minutes.
> h4. Observed Behavior:
>  - After a full checkpoint completes, the next incremental checkpoint shows 
> extremely high start delay
>  - Normal checkpoints take ~30 seconds to complete
>  - Full checkpoints take ~7 minutes to complete
>  - The problematic checkpoint after full checkpoint takes 1-2 hours
>  - The start delay appears to be the main contributing factor to the long 
> duration
> h4. Logs and Evidence:
> Full checkpoint logs showing significant gaps:
> {code:java}
> {"instant":{"epochSecond":1738600670,"nanoOfSecond":603000000},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
>  a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
> ...
> {"instant":{"epochSecond":1738600713,"nanoOfSecond":266000000},"thread":"Checkpoint
>  
> Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
>  request time in queue: 42663"}{code}
>  
> h4. Impact:
>  - Affects pipeline reliability
>  - No data loss observed, but creates operational concerns
> h4. Potential Causes:
>  - Possible locking mechanism bug in Flink internals
>  - Issues with queued checkpoint requests
>  - Interaction between full and incremental checkpoint scheduling
> h4. Attempted Workarounds:
>  - Manually triggering checkpoints
>  - Normal checkpoint operations resume after several hours
> h4. Questions:
> Is this expected behavior when full and incremental checkpoints interact?
> Are there known issues with checkpoint request queuing?
> Are there configuration parameters that could help mitigate this?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion

Reply via email to