[jira] [Created] (FLINK-25958) OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI

2022-02-04 Thread Victor Xu (Jira)
Victor Xu created FLINK-25958:
-

 Summary: OOME Checkpoints & Savepoints were shown as COMPLETE in 
Flink UI
 Key: FLINK-25958
 URL: https://issues.apache.org/jira/browse/FLINK-25958
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.13.5
 Environment: Ververica Platform 2.6.2

Flink 1.13.5
Reporter: Victor Xu
 Attachments: JIRA-1.jpg

Flink job was running but the checkpoints & savepoints were failing all the 
time due to OOM Exception. However, the Flink UI showed COMPLETE for those 
checkpoints & savepoints.

For example (checkpoint 39 & 40):
{noformat}
2022-01-27 02:41:39,969 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job 
ab2217e5ce144087bbddf6bd6c3
668eb.
2022-01-27 02:43:19,678 WARN  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Error while processing AcknowledgeCheckpoint message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the 
pending checkpoint 39. Failure reason: Failure to finalize checkpoint.
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
tream2]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted
        at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?]
        at 
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204)
 ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09
1138-2.jar:?]
        at 
com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83)
 ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2.
jar:?]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        ... 9 more
Caused by: java.lang.OutOfMemoryError: Java heap space
2022-01-27 03:41:39,970 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job 
ab2217e5ce144087bbddf6bd6c3
668eb.
2022-01-27 03:43:22,326 WARN  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Error while processing AcknowledgeCheckpoint message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the 
pending checkpoint 40. Failure reason: Failure to finalize checkpoint.
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2]
        at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
 ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s
tream2]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu

[jira] [Created] (FLINK-24432) RocksIteratorWrapper.seekToLast() calls the wrong RocksIterator method

2021-09-30 Thread Victor Xu (Jira)
Victor Xu created FLINK-24432:
-

 Summary: RocksIteratorWrapper.seekToLast() calls the wrong 
RocksIterator method
 Key: FLINK-24432
 URL: https://issues.apache.org/jira/browse/FLINK-24432
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.14.0
Reporter: Victor Xu


The RocksIteratorWrapper is a wrapper of RocksIterator to do additional status 
check for all the methods. However, there's a typo that 
RocksIteratorWrapper.*seekToLast*() method calls RocksIterator's 
*seekToFirst*(), which is obviously wrong. I guess this issue wasn't found 
before as it was only referenced in the 
RocksTransformingIteratorWrapper.seekToLast() method and nowhere else.
{code:java}
@Override
public void seekToFirst() {
 iterator.seekToFirst();
 status();
}

@Override
public void seekToLast() {
 iterator.seekToFirst();
 status();
}{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24460) Rocksdb Iterator Error Handling Improvement

2021-10-06 Thread Victor Xu (Jira)
Victor Xu created FLINK-24460:
-

 Summary: Rocksdb Iterator Error Handling Improvement
 Key: FLINK-24460
 URL: https://issues.apache.org/jira/browse/FLINK-24460
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.14.0
Reporter: Victor Xu


In FLINK-9373, we introduced RocksIteratorWrapper which was a wrapper around 
RocksIterator to check the iterator status for all the methods. At that time, 
it was required because the iterator may pass the blocks or files it had 
difficulties in reading (because of IO errors, data corruptions, or other 
issues) and continue with the next available keys. *The status flag may not be 
OK, even if the iterator is valid.*

However, the above behaviour changed after 
[3810|https://github.com/facebook/rocksdb/pull/3810] was merged on May 17, 2018:

 *- If the iterator is valid, the status() is guaranteed to be OK;*

 *- If the iterator is not valid, there are two possibilities:*

    *1) We have reached the end of the data. And in this case, status() is OK;*

    *2) There is an error. In this case, status() is not OK;*

More information can be found here: 
https://github.com/facebook/rocksdb/wiki/Iterator#error-handling

Thus, it should be safe to proceed with other operations (e.g. seek, next, 
seekToFirst, seekToLast, seekForPrev, and prev) without checking status(). And 
we only need to check the status if the iterator is invalid. After the change, 
there will be less status() native calls and could theoretically improve 
performance.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)