[jira] [Commented] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied
[ https://issues.apache.org/jira/browse/FLINK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931786#comment-17931786 ] Rafael Zimmermann commented on FLINK-37383: --- I've addressed all comments in the PR and am waiting for the review. With this fix applied, we now have perfect control over the throttle of our Flink pipeline. > ThrottleIterator from examples/utils does not properly throttle on next > window after throttle is applied > > > Key: FLINK-37383 > URL: https://issues.apache.org/jira/browse/FLINK-37383 > Project: Flink > Issue Type: Bug > Components: Examples >Affects Versions: 1.20.1 >Reporter: Rafael Zimmermann >Priority: Minor > Labels: pull-request-available > > The throttle function available on > `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java` > is updating its last batch check time before the sleep operation, causing it > to underestimate the elapsed time and allow approximately double the intended > throughput rate. > [##26203] contains a fix for it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion
Rafael Zimmermann created FLINK-37393: - Summary: Abnormally Long Checkpoint Duration After Full Checkpoint Completion Key: FLINK-37393 URL: https://issues.apache.org/jira/browse/FLINK-37393 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.18.0 Environment: Apache Flink 1.18.0 Pipeline processing ~5.7TB of checkpointed data Using GCS for checkpoint storage Reporter: Rafael Zimmermann Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, evidence.log We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. ### Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration ### Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} ### Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns ### Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling ### Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours ### Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion
[ https://issues.apache.org/jira/browse/FLINK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafael Zimmermann updated FLINK-37393: -- Description: We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. h4. Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration h4. Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} h4. Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns h4. Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling h4. Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours h4. Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? was: We're observing an issue where checkpoints following a full checkpoint take an unusually long time to complete (1-2 hours) in our Flink pipeline, while normal checkpoints typically complete within seconds/minutes. ### Observed Behavior: - After a full checkpoint completes, the next incremental checkpoint shows extremely high start delay - Normal checkpoints take ~30 seconds to complete - Full checkpoints take ~7 minutes to complete - The problematic checkpoint after full checkpoint takes 1-2 hours - The start delay appears to be the main contributing factor to the long duration ### Logs and Evidence: Full checkpoint logs showing significant gaps: {code:java} {"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."} ... {"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint request time in queue: 42663"}{code} ### Impact: - Affects pipeline reliability - No data loss observed, but creates operational concerns ### Potential Causes: - Possible locking mechanism bug in Flink internals - Issues with queued checkpoint requests - Interaction between full and incremental checkpoint scheduling ### Attempted Workarounds: - Manually triggering checkpoints - Normal checkpoint operations resume after several hours ### Questions: Is this expected behavior when full and incremental checkpoints interact? Are there known issues with checkpoint request queuing? Are there configuration parameters that could help mitigate this? > Abnormally Long Checkpoint Duration After Full Checkpoint Completion > > > Key: FLINK-37393 > URL: https://issues.apache.org/jira/browse/FLINK-37393 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.18.0 > Environment: Apache Flink 1.18.0 > Pipeline processing ~5.7TB of checkpointed data > Using GCS for checkpoint storage >Reporter: Rafael Zimmermann >Priority: Major > Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot > 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, > evidence.log > > > We're observing an issue where checkpoints following a full checkpoint take > an unusually long time to complete (1-2 hours) in our Flink pipeline, while > normal checkpoints typically complete within seconds/minutes. > h4. Observed Behavior: > - After a full checkpoint completes, the next incremental checkpoint shows > extre
[jira] [Created] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied
Rafael Zimmermann created FLINK-37383: - Summary: ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied Key: FLINK-37383 URL: https://issues.apache.org/jira/browse/FLINK-37383 Project: Flink Issue Type: Bug Components: Examples Affects Versions: 1.20.1 Reporter: Rafael Zimmermann The throttle function available on `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java` is updating its last batch check time before the sleep operation, causing it to underestimate the elapsed time and allow approximately double the intended throughput rate. [##26203](https://github.com/apache/flink/pull/26203) contains a fix for it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied
[ https://issues.apache.org/jira/browse/FLINK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rafael Zimmermann updated FLINK-37383: -- Description: The throttle function available on `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java` is updating its last batch check time before the sleep operation, causing it to underestimate the elapsed time and allow approximately double the intended throughput rate. [##26203] contains a fix for it was: The throttle function available on `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java` is updating its last batch check time before the sleep operation, causing it to underestimate the elapsed time and allow approximately double the intended throughput rate. [##26203](https://github.com/apache/flink/pull/26203) contains a fix for it > ThrottleIterator from examples/utils does not properly throttle on next > window after throttle is applied > > > Key: FLINK-37383 > URL: https://issues.apache.org/jira/browse/FLINK-37383 > Project: Flink > Issue Type: Bug > Components: Examples >Affects Versions: 1.20.1 >Reporter: Rafael Zimmermann >Priority: Minor > Labels: pull-request-available > > The throttle function available on > `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java` > is updating its last batch check time before the sleep operation, causing it > to underestimate the elapsed time and allow approximately double the intended > throughput rate. > [##26203] contains a fix for it -- This message was sent by Atlassian Jira (v8.20.10#820010)