Hi Everyone,

    We are observing an interesting issue with continuous checkpoint
failures in our job causing the event to not be forwarded through the
pipeline. We saw a spam of the below log in all our task manager instances.
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The
checkpoint was aborted due to exception of other subtasks sharing the
ChannelState file.

Can you help us here with what can be the issue? The issue does not get
fixed until we restart the jobs.


Our setup is something like this.

Flink Version: 1.18.1 (Session Mode)
We use RocksDB without incremental checkpoint
State size: ~ 2GB
No memory issues in TMs

Checkpointing frequency: Every 1 minute



2024-07-30 12:16:49.043 [AsyncOperations-thread-83] INFO
o.a.flink.streaming.runtime.tasks.AsyncCheckpointRunnable  - KeyedProcess
-> GroupRelationBuilderStream (23/60)#0 - asynchronous part of checkpoint
29749 could not be completed.



java.util.concurrent.ExecutionException:
org.apache.flink.runtime.checkpoint.CheckpointException: The checkpoint was
aborted due to exception of other subtasks sharing the ChannelState file.



         at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)



         at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)



         at
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)



         at
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)




at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)



         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)



         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)



         at java.lang.Thread.run(Thread.java:750)



Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The
checkpoint was aborted due to exception of other subtasks sharing the
ChannelState file.




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.fail(ChannelStateCheckpointWriter.java:298)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.failAndClearWriter(ChannelStateWriteRequestDispatcherImpl.java:212)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointAbortRequest(ChannelStateWriteRequestDispatcherImpl.java:189)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:129)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:94)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:172)




at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:127)



         ... 1 common frames omitted



Caused by: java.util.concurrent.CancellationException: checkpoint aborted
via notification



         at
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:456)



         at
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:410)



         at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$18(StreamTask.java:1406)



         at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$20(StreamTask.java:1429)



         at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)




at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)



         at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)



         at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)



         at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)



         at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)



         at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:858)



         at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:807)



         at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953)



         at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:932)



         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746)



         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)



         ... 1 common frames omitted

Reply via email to