[jira] [Created] (FLINK-25958) OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI
Victor Xu created FLINK-25958: - Summary: OOME Checkpoints & Savepoints were shown as COMPLETE in Flink UI Key: FLINK-25958 URL: https://issues.apache.org/jira/browse/FLINK-25958 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.13.5 Environment: Ververica Platform 2.6.2 Flink 1.13.5 Reporter: Victor Xu Attachments: JIRA-1.jpg Flink job was running but the checkpoints & savepoints were failing all the time due to OOM Exception. However, the Flink UI showed COMPLETE for those checkpoints & savepoints. For example (checkpoint 39 & 40): {noformat} 2022-01-27 02:41:39,969 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 39 (type=CHECKPOINT) @ 1643251299952 for job ab2217e5ce144087bbddf6bd6c3 668eb. 2022-01-27 02:43:19,678 WARN org.apache.flink.runtime.jobmaster.JobMaster [] - Error while processing AcknowledgeCheckpoint message org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 39. Failure reason: Failure to finalize checkpoint. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s tream2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.lang.IllegalArgumentException: Self-suppression not permitted at java.lang.Throwable.addSuppressed(Throwable.java:1054) ~[?:?] at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:627) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.serializeCheckpoint(KubernetesHaCheckpointStore.java:204) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.09 1138-2.jar:?] at com.ververica.platform.flink.ha.kubernetes.KubernetesHaCheckpointStore.addCheckpoint(KubernetesHaCheckpointStore.java:83) ~[vvp-flink-ha-kubernetes-flink113-1.4-20211013.091138-2. jar:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] ... 9 more Caused by: java.lang.OutOfMemoryError: Java heap space 2022-01-27 03:41:39,970 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 40 (type=CHECKPOINT) @ 1643254899952 for job ab2217e5ce144087bbddf6bd6c3 668eb. 2022-01-27 03:43:22,326 WARN org.apache.flink.runtime.jobmaster.JobMaster [] - Error while processing AcknowledgeCheckpoint message org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 40. Failure reason: Failure to finalize checkpoint. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-stream2] at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) ~[flink-dist_2.12-1.13.5-stream2.jar:1.13.5-s tream2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
[jira] [Created] (FLINK-24432) RocksIteratorWrapper.seekToLast() calls the wrong RocksIterator method
Victor Xu created FLINK-24432: - Summary: RocksIteratorWrapper.seekToLast() calls the wrong RocksIterator method Key: FLINK-24432 URL: https://issues.apache.org/jira/browse/FLINK-24432 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.14.0 Reporter: Victor Xu The RocksIteratorWrapper is a wrapper of RocksIterator to do additional status check for all the methods. However, there's a typo that RocksIteratorWrapper.*seekToLast*() method calls RocksIterator's *seekToFirst*(), which is obviously wrong. I guess this issue wasn't found before as it was only referenced in the RocksTransformingIteratorWrapper.seekToLast() method and nowhere else. {code:java} @Override public void seekToFirst() { iterator.seekToFirst(); status(); } @Override public void seekToLast() { iterator.seekToFirst(); status(); }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24460) Rocksdb Iterator Error Handling Improvement
Victor Xu created FLINK-24460: - Summary: Rocksdb Iterator Error Handling Improvement Key: FLINK-24460 URL: https://issues.apache.org/jira/browse/FLINK-24460 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Affects Versions: 1.14.0 Reporter: Victor Xu In FLINK-9373, we introduced RocksIteratorWrapper which was a wrapper around RocksIterator to check the iterator status for all the methods. At that time, it was required because the iterator may pass the blocks or files it had difficulties in reading (because of IO errors, data corruptions, or other issues) and continue with the next available keys. *The status flag may not be OK, even if the iterator is valid.* However, the above behaviour changed after [3810|https://github.com/facebook/rocksdb/pull/3810] was merged on May 17, 2018: *- If the iterator is valid, the status() is guaranteed to be OK;* *- If the iterator is not valid, there are two possibilities:* *1) We have reached the end of the data. And in this case, status() is OK;* *2) There is an error. In this case, status() is not OK;* More information can be found here: https://github.com/facebook/rocksdb/wiki/Iterator#error-handling Thus, it should be safe to proceed with other operations (e.g. seek, next, seekToFirst, seekToLast, seekForPrev, and prev) without checking status(). And we only need to check the status if the iterator is invalid. After the change, there will be less status() native calls and could theoretically improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)