[jira] [Commented] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied

2025-03-01 Thread Rafael Zimmermann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931786#comment-17931786
 ] 

Rafael Zimmermann commented on FLINK-37383:
---

I've addressed all comments in the PR and am waiting for the review. With this 
fix applied, we now have perfect control over the throttle of our Flink 
pipeline.

> ThrottleIterator from examples/utils does not properly throttle on next 
> window after throttle is applied
> 
>
> Key: FLINK-37383
> URL: https://issues.apache.org/jira/browse/FLINK-37383
> Project: Flink
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.20.1
>Reporter: Rafael Zimmermann
>Priority: Minor
>  Labels: pull-request-available
>
> The throttle function available on 
> `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java`
>  is updating its last batch check time before the sleep operation, causing it 
> to underestimate the elapsed time and allow approximately double the intended 
> throughput rate.
> [##26203] contains a fix for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion

2025-02-26 Thread Rafael Zimmermann (Jira)
Rafael Zimmermann created FLINK-37393:
-

 Summary: Abnormally Long Checkpoint Duration After Full Checkpoint 
Completion
 Key: FLINK-37393
 URL: https://issues.apache.org/jira/browse/FLINK-37393
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.18.0
 Environment: Apache Flink 1.18.0
Pipeline processing ~5.7TB of checkpointed data
Using GCS for checkpoint storage
Reporter: Rafael Zimmermann
 Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 
2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, 
evidence.log

We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.



### Observed Behavior:

- After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
- Normal checkpoints take ~30 seconds to complete
- Full checkpoints take ~7 minutes to complete
- The problematic checkpoint after full checkpoint takes 1-2 hours
- The start delay appears to be the main contributing factor to the long 
duration

### Logs and Evidence:

Full checkpoint logs showing significant gaps:


{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 

### Impact:

- Affects pipeline reliability
- No data loss observed, but creates operational concerns



### Potential Causes:

- Possible locking mechanism bug in Flink internals
- Issues with queued checkpoint requests
- Interaction between full and incremental checkpoint scheduling

### Attempted Workarounds:

- Manually triggering checkpoints
- Normal checkpoint operations resume after several hours

### Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-37393) Abnormally Long Checkpoint Duration After Full Checkpoint Completion

2025-02-26 Thread Rafael Zimmermann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafael Zimmermann updated FLINK-37393:
--
Description: 
We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.
h4. Observed Behavior:
 - After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
 - Normal checkpoints take ~30 seconds to complete
 - Full checkpoints take ~7 minutes to complete
 - The problematic checkpoint after full checkpoint takes 1-2 hours
 - The start delay appears to be the main contributing factor to the long 
duration

h4. Logs and Evidence:

Full checkpoint logs showing significant gaps:
{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 
h4. Impact:
 - Affects pipeline reliability
 - No data loss observed, but creates operational concerns

h4. Potential Causes:
 - Possible locking mechanism bug in Flink internals
 - Issues with queued checkpoint requests
 - Interaction between full and incremental checkpoint scheduling

h4. Attempted Workarounds:
 - Manually triggering checkpoints
 - Normal checkpoint operations resume after several hours

h4. Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 

  was:
We're observing an issue where checkpoints following a full checkpoint take an 
unusually long time to complete (1-2 hours) in our Flink pipeline, while normal 
checkpoints typically complete within seconds/minutes.



### Observed Behavior:

- After a full checkpoint completes, the next incremental checkpoint shows 
extremely high start delay
- Normal checkpoints take ~30 seconds to complete
- Full checkpoints take ~7 minutes to complete
- The problematic checkpoint after full checkpoint takes 1-2 hours
- The start delay appears to be the main contributing factor to the long 
duration

### Logs and Evidence:

Full checkpoint logs showing significant gaps:


{code:java}
{"instant":{"epochSecond":1738600670,"nanoOfSecond":60300},"thread":"flink-pekko.actor.default-dispatcher-18","level":"INFO","loggerName":"org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler","message":"Triggering
 a checkpoint for job 36762c0137f9ed6a9d5e9ce3dc933871."}
...
{"instant":{"epochSecond":1738600713,"nanoOfSecond":26600},"thread":"Checkpoint
 
Timer","level":"INFO","loggerName":"org.apache.flink.runtime.checkpoint.CheckpointRequestDecider","message":"checkpoint
 request time in queue: 42663"}{code}
 

### Impact:

- Affects pipeline reliability
- No data loss observed, but creates operational concerns



### Potential Causes:

- Possible locking mechanism bug in Flink internals
- Issues with queued checkpoint requests
- Interaction between full and incremental checkpoint scheduling

### Attempted Workarounds:

- Manually triggering checkpoints
- Normal checkpoint operations resume after several hours

### Questions:

Is this expected behavior when full and incremental checkpoints interact?
Are there known issues with checkpoint request queuing?
Are there configuration parameters that could help mitigate this?

 


> Abnormally Long Checkpoint Duration After Full Checkpoint Completion
> 
>
> Key: FLINK-37393
> URL: https://issues.apache.org/jira/browse/FLINK-37393
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.18.0
> Environment: Apache Flink 1.18.0
> Pipeline processing ~5.7TB of checkpointed data
> Using GCS for checkpoint storage
>Reporter: Rafael Zimmermann
>Priority: Major
> Attachments: Screenshot 2025-02-03 at 1.52.09 PM.png, Screenshot 
> 2025-02-03 at 1.52.23 PM.png, Screenshot 2025-02-03 at 1.52.30 PM.png, 
> evidence.log
>
>
> We're observing an issue where checkpoints following a full checkpoint take 
> an unusually long time to complete (1-2 hours) in our Flink pipeline, while 
> normal checkpoints typically complete within seconds/minutes.
> h4. Observed Behavior:
>  - After a full checkpoint completes, the next incremental checkpoint shows 
> extre

[jira] [Created] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied

2025-02-25 Thread Rafael Zimmermann (Jira)
Rafael Zimmermann created FLINK-37383:
-

 Summary: ThrottleIterator from examples/utils does not properly 
throttle on next window after throttle is applied
 Key: FLINK-37383
 URL: https://issues.apache.org/jira/browse/FLINK-37383
 Project: Flink
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.20.1
Reporter: Rafael Zimmermann


The throttle function available on 
`flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java`
 is updating its last batch check time before the sleep operation, causing it 
to underestimate the elapsed time and allow approximately double the intended 
throughput rate.

[##26203](https://github.com/apache/flink/pull/26203) contains a fix for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-37383) ThrottleIterator from examples/utils does not properly throttle on next window after throttle is applied

2025-02-25 Thread Rafael Zimmermann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-37383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafael Zimmermann updated FLINK-37383:
--
Description: 
The throttle function available on 
`flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java`
 is updating its last batch check time before the sleep operation, causing it 
to underestimate the elapsed time and allow approximately double the intended 
throughput rate.

[##26203] contains a fix for it

  was:
The throttle function available on 
`flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java`
 is updating its last batch check time before the sleep operation, causing it 
to underestimate the elapsed time and allow approximately double the intended 
throughput rate.

[##26203](https://github.com/apache/flink/pull/26203) contains a fix for it


> ThrottleIterator from examples/utils does not properly throttle on next 
> window after throttle is applied
> 
>
> Key: FLINK-37383
> URL: https://issues.apache.org/jira/browse/FLINK-37383
> Project: Flink
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.20.1
>Reporter: Rafael Zimmermann
>Priority: Minor
>  Labels: pull-request-available
>
> The throttle function available on 
> `flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/utils/ThrottledIterator.java`
>  is updating its last batch check time before the sleep operation, causing it 
> to underestimate the elapsed time and allow approximately double the intended 
> throughput rate.
> [##26203] contains a fix for it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)