Re: Heartbeat of TaskManager with id container

Y SREEKARA BHARGAVA REDDY Thu, 03 Aug 2023 21:23:11 -0700

Hi Xiangyu/Dev Team,

Thanks for reply.


In  our flink job, we increase the *checkpoint timeout to 30 min.*
And the *checkpoint interval is 10 min.*

But while running the job we got below exception.

java.lang.RuntimeException: Error while confirming checkpoint
    at org.apache.flink.streaming.runtime.tasks.StreamTask
.notifyCheckpointComplete(StreamTask.java:952)
    at org.apache.flink.streaming.runtime.tasks.StreamTask
.lambda$notifyCheckpointCompleteAsync$7(StreamTask.java:924)
    at org.apache.flink.util.function.FunctionUtils.lambda$asCallable$5(
FunctionUtils.java:125)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.streaming.runtime.tasks.
StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(
StreamTaskActionExecutor.java:87)
    at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:
78)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor
.processMail(MailboxProcessor.java:261)
    at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor
.runMailboxLoop(MailboxProcessor.java:186)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(
StreamTask.java:487)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask
.java:470)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Cannot clean commit: Staging file does not
exist.
    at org.apache.flink.runtime.fs.hdfs.
HadoopRecoverableFsDataOutputStream$HadoopFsCommitter.commit(
HadoopRecoverableFsDataOutputStream.java:250)
    at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket
.onSuccessfulCompletionOfCheckpoint(Bucket.java:300)
    at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets
.commitUpToCheckpoint(Buckets.java:216)
    at org.apache.flink.streaming.api.functions.sink.filesystem.
StreamingFileSink.notifyCheckpointComplete(StreamingFileSink.java:415)
    at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator
.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
    at org.apache.flink.streaming.runtime.tasks.StreamTask
.lambda$notifyCheckpointComplete$8(StreamTask.java:936)
    at org.apache.flink.streaming.runtime.tasks.
StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.call(
StreamTaskActionExecutor.java:101)
    at org.apache.flink.streaming.runtime.tasks.StreamTask
.notifyCheckpointComplete(StreamTask.java:930)
    ... 12 more

It would be great, if you have any workaround for that.

Regards,
Nagireddy Y.






On Thu, Aug 3, 2023 at 7:24 AM xiangyu feng <xiangyu...@gmail.com> wrote:

> Hi ynagireddy4u,
>
> We have met this exception before. Usually it is caused by following
> reasons:
>
> 1), TaskManager is too busy with other works to send the heartbeat to
> JobMaster or TaskManager process might already exited;
> 2), There might be a network issues between this TaskManager and JobMaster;
> 3), In certain cases, JobMaster actor might also being too busy to process
> the RPC requests from TaskManager;
>
> Pls check if your problem fits the above situations.
>
> Best,
> Xiangyu
>
>
> Y SREEKARA BHARGAVA REDDY <ynagiredd...@gmail.com> 于2023年7月31日周一 20:49写道：
>
>> Hi Team,
>>
>> Did any one face the below exception.
>> If yes, please share the resolution.
>>
>>
>> 2023-07-28 22:04:16
>> j*ava.util.concurrent.TimeoutException: Heartbeat of TaskManager with id
>> container_e19_1690528962823_0382_01_000005 timed out.*
>>     at org.apache.flink.runtime.jobmaster.
>> JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster
>> .java:1147)
>>     at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(
>> HeartbeatMonitorImpl.java:109)
>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
>> 511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
>> AkkaRpcActor.java:397)
>>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
>> AkkaRpcActor.java:190)
>>     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor
>> .handleRpcMessage(FencedAkkaRpcActor.java:74)
>>     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(
>> AkkaRpcActor.java:152)
>>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>>     at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>     at akka.japi.pf
>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>     at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>>     at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>>     at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>>     at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>>     at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>>     at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>     at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool
>> .java:1339)
>>     at
>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>     at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
>> .java:107)
>>
>> Any suggestions, please share with me.
>>
>> Regards,
>> Nagireddy Y
>>
>

Re: Heartbeat of TaskManager with id container

Reply via email to