Re: Failure to restore from last completed checkpoint

Alexis Sarda-Espinosa Fri, 08 Sep 2023 15:55:57 -0700

Hello,

Just a shot in the dark here, but could it be related to
https://issues.apache.org/jira/browse/FLINK-32241 ?


Such failures can cause many exceptions, but I think the ones you've
included aren't pointing to the root cause, so I'm not sure if that issue
applies to you.

Regards,
Alexis.

On Fri, 8 Sept 2023, 20:43 Jacqlyn Bender via user, <user@flink.apache.org>
wrote:

> Hi Yanfei,
>
> We were never able to restore from a checkpoint, we ended up restoring
> from a savepoint as fallback. Would those logs suggest we failed to take a
> checkpoint before the job manager restarted? Our observabillity monitors
> showed no failed checkpoints.
>
> Here is an exception that occurred before the failure to restore from the
> checkpoint:
>
> java.io.IOException: Cannot register Closeable, registry is already
> closed. Closing argument.
>
> at
> org.apache.flink.util.AbstractAutoCloseableRegistry.registerCloseable(AbstractAutoCloseableRegistry.java:89)
> ~[a-pipeline-name.jar:1.0]
>
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:128)
> ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]
>
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110)
> ~[flink-dist-1.17.1-shopify-81a88f8.jar:1.17.1-shopify-81a88f8]
>
> at
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
> ~[a-pipeline-name.jar:1.0]
>
> at
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> ~[?:?] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
>
> at java.lang.Thread.run(Thread.java:829) [?:?]
>
> Thanks,
> Jacqlyn
>
> On Thu, Sep 7, 2023 at 7:42 PM Yanfei Lei <fredia...@gmail.com> wrote:
>
>> Hey Jacqlyn,
>> According to the stack trace, it seems that there is a problem when
>> the checkpoint is triggered. Is this the problem after the restore?
>> would you like to share some logs related to restoring?
>>
>> Best,
>> Yanfei
>>
>> Jacqlyn Bender via user <user@flink.apache.org> 于2023年9月8日周五 05:11写道：
>> >
>> > Hey folks,
>> >
>> >
>> > We experienced a pipeline failure where our job manager restarted and
>> we were for some reason unable to restore from our last successful
>> checkpoint. We had regularly completed checkpoints every 10 minutes up to
>> this failure and 0 failed checkpoints logged. Using Flink version 1.17.1.
>> >
>> >
>> > Wondering if anyone can shed light on what might have happened?
>> >
>> >
>> > Here's the error from our logs:
>> >
>> >
>> > Message: FATAL: Thread ‘Checkpoint Timer’ produced an uncaught
>> exception. Stopping the process...
>> >
>> >
>> > extendedStackTrace: java.util.concurrent.CompletionException:
>> java.util.concurrent.CompletionException: java.lang.NullPointerException
>> >
>> > at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:669)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>> [?:?]
>> >
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>> >
>> > at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> [?:?]
>> >
>> > at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> [?:?]
>> >
>> > at java.lang.Thread.run(Thread.java:829) [?:?]
>> >
>> > Caused by: java.util.concurrent.CompletionException:
>> java.lang.NullPointerException
>> >
>> > at
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>> ~[?:?]
>> >
>> > ... 7 more
>> >
>> > Caused by: java.lang.NullPointerException
>> >
>> > at
>> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:399)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
>> >
>> > at
>> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
>> ~[?:?]
>> >
>> > at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:947)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:923)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:655)
>> ~[a-pipeline-name.jar:1.0]
>> >
>> > at
>> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
>> ~[?:?]
>> >
>> > at
>> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
>> ~[?:?]
>> >
>> > ... 7 more
>> >
>> >
>>
>

Re: Failure to restore from last completed checkpoint

Reply via email to