Hello, Just a shot in the dark here, but could it be related to https://issues.apache.org/jira/browse/FLINK-32241 ?
Such failures can cause many exceptions, but I think the ones you've included aren't pointing to the root cause, so I'm not sure if that issue applies to you. Regards, Alexis. On Fri, 8 Sept 2023, 20:43 Jacqlyn Bender via user, <user@flink.apache.org> wrote: > Hi Yanfei, > > We were never able to restore from a checkpoint, we ended up restoring > from a savepoint as fallback. Would those logs suggest we failed to take a > checkpoint before the job manager restarted? Our observabillity monitors > showed no failed checkpoints. > > Here is an exception that occurred before the failure to restore from the > checkpoint: > > java.io.IOException: Cannot register Closeable, registry is already > closed. Closing argument. > > at > org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89) > ~[a-pipeline-name.jar:1.0] > > at > org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128) > ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] > > at > org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) > ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8] > > at > org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) > ~[a-pipeline-name.jar:1.0] > > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) > ~[?:?] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > > at java.lang.Thread.run(Thread.java:829) [?:?] > > Thanks, > Jacqlyn > > On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei <fredia...@gmail.com> wrote: > >> Hey Jacqlyn, >> According to the stack trace, it seems that there is a problem when >> the checkpoint is triggered. Is this the problem after the restore? >> would you like to share some logs related to restoring? >> >> Best, >> Yanfei >> >> Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道: >> > >> > Hey folks, >> > >> > >> > We experienced a pipeline failure where our job manager restarted and >> we were for some reason unable to restore from our last successful >> checkpoint. We had regularly completed checkpoints every 10 minutes up to >> this failure and 0 failed checkpoints logged. Using Flink version 1.17.1. >> > >> > >> > Wondering if anyone can shed light on what might have happened? >> > >> > >> > Here's the error from our logs: >> > >> > >> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught >> exception. Stopping the process... >> > >> > >> > extendedStackTrace: java.util.concurrent.CompletionException: >> java.util.concurrent.CompletionException: java.lang.NullPointerException >> > >> > at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669) >> ~[a-pipeline-name.jar:1.0] >> > >> > at >> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) >> [?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) >> [?:?] >> > >> > at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) >> [?:?] >> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] >> > >> > at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) >> [?:?] >> > >> > at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >> [?:?] >> > >> > at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >> [?:?] >> > >> > at java.lang.Thread.run(Thread.java:829) [?:?] >> > >> > Caused by: java.util.concurrent.CompletionException: >> java.lang.NullPointerException >> > >> > at >> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) >> ~[?:?] >> > >> > ... 7 more >> > >> > Caused by: java.lang.NullPointerException >> > >> > at >> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399) >> ~[a-pipeline-name.jar:1.0] >> > >> > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?] >> > >> > at >> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) >> ~[?:?] >> > >> > at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947) >> ~[a-pipeline-name.jar:1.0] >> > >> > at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923) >> ~[a-pipeline-name.jar:1.0] >> > >> > at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655) >> ~[a-pipeline-name.jar:1.0] >> > >> > at >> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) >> ~[?:?] >> > >> > at >> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) >> ~[?:?] >> > >> > ... 7 more >> > >> > >> >