We have also enabled unaligned checkpoints. Could it be because of that? We
were experience slowness and intermittent packet loss when this issue
occurred.

On Wed, Jul 31, 2024 at 7:43 PM Dhruv Patel <dhruv...@gmail.com> wrote:

> Hi Everyone,
>
>     We are observing an interesting issue with continuous checkpoint
> failures in our job causing the event to not be forwarded through the
> pipeline. We saw a spam of the below log in all our task manager instances.
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The
> checkpoint was aborted due to exception of other subtasks sharing the
> ChannelState file.
>
> Can you help us here with what can be the issue? The issue does not get
> fixed until we restart the jobs.
>
>
> Our setup is something like this.
>
> Flink Version: 1.18.1 (Session Mode)
> We use RocksDB without incremental checkpoint
> State size: ~ 2GB
> No memory issues in TMs
>
> Checkpointing frequency: Every 1 minute
>
>
>
> 2024-07-30 12:16:49.043 [AsyncOperations-thread-83] INFO
> o.a.flink.streaming.runtime.tasks.AsyncCheckpointRunnable  - KeyedProcess
> -> GroupRelationBuilderStream (23/60)#0 - asynchronous part of checkpoint
> 29749 could not be completed.
>
>
>
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.checkpoint.CheckpointException: The checkpoint was
> aborted due to exception of other subtasks sharing the ChannelState file.
>
>
>
>          at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>
>
>
>          at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>
>
>
>          at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
>
>
>
>
> at 
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
>
>
>
>          at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>
>
>          at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>
>
>          at java.lang.Thread.run(Thread.java:750)
>
>
>
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The
> checkpoint was aborted due to exception of other subtasks sharing the
> ChannelState file.
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.fail(ChannelStateCheckpointWriter.java:298)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.failAndClearWriter(ChannelStateWriteRequestDispatcherImpl.java:212)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointAbortRequest(ChannelStateWriteRequestDispatcherImpl.java:189)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:129)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:94)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:172)
>
>
>
>
> at 
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:127)
>
>
>
>          ... 1 common frames omitted
>
>
>
> Caused by: java.util.concurrent.CancellationException: checkpoint aborted
> via notification
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:456)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:410)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$18(StreamTask.java:1406)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$20(StreamTask.java:1429)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
>
>
>
>
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858)
>
>
>
>          at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807)
>
>
>
>          at
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953)
>
>
>
>          at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:932)
>
>
>
>          at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746)
>
>
>
>          at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
>
>
>
>          ... 1 common frames omitted
>

Reply via email to