Hi Xiangyu/Dev Team, Thanks for reply.
In our flink job, we increase the *checkpoint timeout to 30 min.* And the *checkpoint interval is 10 min.* But while running the job we got below exception. java.lang.RuntimeException: Error while confirming checkpoint at org.apache.flink.streaming.runtime.tasks.StreamTask .notifyCheckpointComplete(StreamTask.java:952) at org.apache.flink.streaming.runtime.tasks.StreamTask .lambda$notifyCheckpointCompleteAsync$7(StreamTask.java:924) at org.apache.flink.util.function.FunctionUtils.lambda$asCallable$5( FunctionUtils.java:125) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.streaming.runtime.tasks. StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run( StreamTaskActionExecutor.java:87) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java: 78) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor .processMail(MailboxProcessor.java:261) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor .runMailboxLoop(MailboxProcessor.java:186) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop( StreamTask.java:487) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask .java:470) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Cannot clean commit: Staging file does not exist. at org.apache.flink.runtime.fs.hdfs. HadoopRecoverableFsDataOutputStream$HadoopFsCommitter.commit( HadoopRecoverableFsDataOutputStream.java:250) at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket .onSuccessfulCompletionOfCheckpoint(Bucket.java:300) at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets .commitUpToCheckpoint(Buckets.java:216) at org.apache.flink.streaming.api.functions.sink.filesystem. StreamingFileSink.notifyCheckpointComplete(StreamingFileSink.java:415) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator .notifyCheckpointComplete(AbstractUdfStreamOperator.java:130) at org.apache.flink.streaming.runtime.tasks.StreamTask .lambda$notifyCheckpointComplete$8(StreamTask.java:936) at org.apache.flink.streaming.runtime.tasks. StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call( StreamTaskActionExecutor.java:101) at org.apache.flink.streaming.runtime.tasks.StreamTask .notifyCheckpointComplete(StreamTask.java:930) ... 12 more It would be great, if you have any workaround for that. Regards, Nagireddy Y. On Thu, Aug 3, 2023 at 7:24 AM xiangyu feng <xiangyu...@gmail.com> wrote: > Hi ynagireddy4u, > > We have met this exception before. Usually it is caused by following > reasons: > > 1), TaskManager is too busy with other works to send the heartbeat to > JobMaster or TaskManager process might already exited; > 2), There might be a network issues between this TaskManager and JobMaster; > 3), In certain cases, JobMaster actor might also being too busy to process > the RPC requests from TaskManager; > > Pls check if your problem fits the above situations. > > Best, > Xiangyu > > > Y SREEKARA BHARGAVA REDDY <ynagiredd...@gmail.com> 于2023年7月31日周一 20:49写道: > >> Hi Team, >> >> Did any one face the below exception. >> If yes, please share the resolution. >> >> >> 2023-07-28 22:04:16 >> j*ava.util.concurrent.TimeoutException: Heartbeat of TaskManager with id >> container_e19_1690528962823_0382_01_000005 timed out.* >> at org.apache.flink.runtime.jobmaster. >> JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster >> .java:1147) >> at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run( >> HeartbeatMonitorImpl.java:109) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java: >> 511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >> AkkaRpcActor.java:397) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >> AkkaRpcActor.java:190) >> at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor >> .handleRpcMessage(FencedAkkaRpcActor.java:74) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >> AkkaRpcActor.java:152) >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at akka.japi.pf >> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >> .java:1339) >> at >> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >> .java:107) >> >> Any suggestions, please share with me. >> >> Regards, >> Nagireddy Y >> >