We have also enabled unaligned checkpoints. Could it be because of that? We were experience slowness and intermittent packet loss when this issue occurred.
On Wed, Jul 31, 2024 at 7:43 PM Dhruv Patel <dhruv...@gmail.com> wrote: > Hi Everyone, > > We are observing an interesting issue with continuous checkpoint > failures in our job causing the event to not be forwarded through the > pipeline. We saw a spam of the below log in all our task manager instances. > Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The > checkpoint was aborted due to exception of other subtasks sharing the > ChannelState file. > > Can you help us here with what can be the issue? The issue does not get > fixed until we restart the jobs. > > > Our setup is something like this. > > Flink Version: 1.18.1 (Session Mode) > We use RocksDB without incremental checkpoint > State size: ~ 2GB > No memory issues in TMs > > Checkpointing frequency: Every 1 minute > > > > 2024-07-30 12:16:49.043 [AsyncOperations-thread-83] INFO > o.a.flink.streaming.runtime.tasks.AsyncCheckpointRunnable - KeyedProcess > -> GroupRelationBuilderStream (23/60)#0 - asynchronous part of checkpoint > 29749 could not be completed. > > > > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.checkpoint.CheckpointException: The checkpoint was > aborted due to exception of other subtasks sharing the ChannelState file. > > > > at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > > > > at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > > > > at > org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66) > > > > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191) > > > > > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124) > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > at java.lang.Thread.run(Thread.java:750) > > > > Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The > checkpoint was aborted due to exception of other subtasks sharing the > ChannelState file. > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.fail(ChannelStateCheckpointWriter.java:298) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.failAndClearWriter(ChannelStateWriteRequestDispatcherImpl.java:212) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointAbortRequest(ChannelStateWriteRequestDispatcherImpl.java:189) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:129) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:94) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:172) > > > > > at > org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:127) > > > > ... 1 common frames omitted > > > > Caused by: java.util.concurrent.CancellationException: checkpoint aborted > via notification > > > > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:456) > > > > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:410) > > > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$18(StreamTask.java:1406) > > > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$20(StreamTask.java:1429) > > > > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) > > > > > at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) > > > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398) > > > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367) > > > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352) > > > > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229) > > > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858) > > > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807) > > > > at > org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953) > > > > at > org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:932) > > > > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746) > > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) > > > > ... 1 common frames omitted >