Hi Yanfei, We were never able to restore from a checkpoint, we ended up restoring from a savepoint as fallback. Would those logs suggest we failed to take a checkpoint before the job manager restarted? Our observabillity monitors showed no failed checkpoints.
Here is an exception that occurred before the failure to restore from the checkpoint: java.io.IOException: Cannot register Closeable, registry is already closed. Closing argument. at org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89) ~[a-pipeline-name.jar:1.0] at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128) ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[a-pipeline-name.jar:1.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Thanks, Jacqlyn On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei <fredia...@gmail.com> wrote: > Hey Jacqlyn, > According to the stack trace, it seems that there is a problem when > the checkpoint is triggered. Is this the problem after the restore? > would you like to share some logs related to restoring? > > Best, > Yanfei > > Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道: > > > > Hey folks, > > > > > > We experienced a pipeline failure where our job manager restarted and we > were for some reason unable to restore from our last successful checkpoint. > We had regularly completed checkpoints every 10 minutes up to this failure > and 0 failed checkpoints logged. Using Flink version 1.17.1. > > > > > > Wondering if anyone can shed light on what might have happened? > > > > > > Here's the error from our logs: > > > > > > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught > exception. Stopping the process... > > > > > > extendedStackTrace: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.NullPointerException > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) > ~[a-pipeline-name.jar:1.0] > > > > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > > > > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > [?:?] > > > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) > [?:?] > > > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [?:?] > > > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > [?:?] > > > > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > > > > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > > > at java.lang.Thread.run(Thread.java:829) [?:?] > > > > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > > > > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > > > ... 7 more > > > > Caused by: java.lang.NullPointerException > > > > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) > ~[a-pipeline-name.jar:1.0] > > > > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] > > > > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) > ~[?:?] > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) > ~[a-pipeline-name.jar:1.0] > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) > ~[a-pipeline-name.jar:1.0] > > > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) > ~[a-pipeline-name.jar:1.0] > > > > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > ~[?:?] > > > > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ~[?:?] > > > > ... 7 more > > > > >