Re: Network PartitionNotFoundException when run on multi nodes

Steffen Wohlers Mon, 23 Jul 2018 00:58:50 -0700

Hi Zhijiang, Minglei, all,

your both hints and explanations work well. Thank you very much!


Thanks,
Steffen

> On 23. Jul 2018, at 08:10, Zhijiang(wangzhijiang999) 
> <wangzhijiang...@aliyun.com> wrote:
> 
> Hi Steffen,
> 
> This exception indicates that when the downstream task requests partition 
> from the upstream task, the upstream task has not initialized to register its 
> result partition.
> In this case, the downstream task will inquire the state from job manager, 
> and then retry to request partition from the upstream until the maximum retry 
> timeout. You can increase the parameter of 
> "taskmanager.network.request-backoff.max" to check whether it works, the 
> default value is 10s.
> 
> BTW, you should check why the upstream registers its result partition 
> delayed, maybe the upstream TaskManager received the task deployment delayed 
> from JobManager, or some operations in upstream task initialization 
> unexpectly cost more time before registering result partition. 
> 
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人：Steffen Wohlers <steffenwohl...@gmx.de>
> 发送时间：2018年7月22日(星期日) 22:22
> 收件人：user <user@flink.apache.org>
> 主　题：Network PartitionNotFoundException when run on multi nodes
> 
> Hi all,
> 
> I have some problems when running my application on more than one Task 
> Manager.
> 
> setup:
> node1: Job Manager, Task Manager
> node2: Task Manager
> 
> I can run my program successfully on each node alone when I stop the other 
> Task Manager.
> But when I start both and set parallelism = 2, every time I got the following 
> exception (after 30 seconds):
> 
> org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: 
> Partition 6372b5f434d55e987ea179d6f6b488fe@e389ca50a2c2cf776b90268f987a6546 
> not found.
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:273)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:182)
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:400)
>       at 
> org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1294)
>       at 
> org.apache.flink.runtime.taskmanager.Task.lambda$triggerPartitionProducerStateCheck$0(Task.java:1151)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> 
> It seems the problem occurs when a subtask is linked to both Task Manager.
> 
> Does anybody know how I can make it work?
> 
> Thanks,
> Steffen
> 
>

Re: Network PartitionNotFoundException when run on multi nodes

Reply via email to