Re: Failure to restore from last completed checkpoint

Jacqlyn Bender via user Fri, 08 Sep 2023 11:43:10 -0700

Hi Yanfei,

We were never able to restore from a checkpoint, we ended up restoring from
a savepoint as fallback. Would those logs suggest we failed to take a
checkpoint before the job manager restarted? Our observabillity monitors
showed no failed checkpoints.


Here is an exception that occurred before the failure to restore from the
checkpoint:

java.io.IOException: Cannot register Closeable, registry is already closed.
Closing argument.

at
org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89)
~[a-pipeline-name.jar:1.0]

at
org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128)
~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]

at
org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110)
~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]

at
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
~[a-pipeline-name.jar:1.0]

at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
~[?:?] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]

at java.lang.Thread.run(Thread.java:829) [?:?]

Thanks,
Jacqlyn

On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei <fredia...@gmail.com> wrote:

> Hey Jacqlyn,
> According to the stack trace, it seems that there is a problem when
> the checkpoint is triggered. Is this the problem after the restore?
> would you like to share some logs related to restoring?
>
> Best,
> Yanfei
>
> Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道：
> >
> > Hey folks,
> >
> >
> > We experienced a pipeline failure where our job manager restarted and we
> were for some reason unable to restore from our last successful checkpoint.
> We had regularly completed checkpoints every 10 minutes up to this failure
> and 0 failed checkpoints logged. Using Flink version 1.17.1.
> >
> >
> > Wondering if anyone can shed light on what might have happened?
> >
> >
> > Here's the error from our logs:
> >
> >
> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught
> exception. Stopping the process...
> >
> >
> > extendedStackTrace: java.util.concurrent.CompletionException:
> java.util.concurrent.CompletionException: java.lang.NullPointerException
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
> [?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> [?:?]
> >
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [?:?]
> >
> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> >
> > at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> [?:?]
> >
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> >
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> >
> > at java.lang.Thread.run(Thread.java:829) [?:?]
> >
> > Caused by: java.util.concurrent.CompletionException:
> java.lang.NullPointerException
> >
> > at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> ~[?:?]
> >
> > ... 7 more
> >
> > Caused by: java.lang.NullPointerException
> >
> > at
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
> ~[a-pipeline-name.jar:1.0]
> >
> > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
> >
> > at
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
> ~[?:?]
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655)
> ~[a-pipeline-name.jar:1.0]
> >
> > at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> ~[?:?]
> >
> > at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> ~[?:?]
> >
> > ... 7 more
> >
> >
>

Re: Failure to restore from last completed checkpoint

Reply via email to