Hey Gyula!

I'm including Piotr and Nico (cc'd) who have worked on the network
stack in the last releases.

Registering the network structures including the intermediate results
actually happens **before** any state is restored. I'm not sure why
this reproducibly happens when you restore state. @Nico, Piotr: any
ideas here?

In general I think what happens here is the following:
- a task requests the result of a local upstream producer, but that
one has not registered its intermediate result yet
- this should result in a retry of the request with some backoff
(controlled via the config params you mention
taskmanager.network.request-backoff.max,
taskmanager.network.request-backoff.initial)

As a first step I would set logging to DEBUG and check the TM logs for
messages like "Retriggering partition request {}:{}."

You can also check the SingleInputGate code which has the logic for
retriggering requests.

– Ufuk


On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.f...@gmail.com> wrote:
> Hi Ufuk,
>
> Do you have any quick idea what could cause this problems in flink 1.4.2?
> Seems like one operator takes too long to deploy and downstream tasks error
> out on partition not found. This only seems to happen when the job is
> restored from state and in fact that operator has some keyed and operator
> state as well.
>
> Deploying the same job from empty state works well. We tried increasing the
> taskmanager.network.request-backoff.max that didnt help.
>
> It would be great if you have some pointers where to look further, I havent
> seen this happening before.
>
> Thank you!
> Gyula
>
> The errror:
> org.apache.flink.runtime.io.network.partition.: Partition
> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
>     at
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>     at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>     at
> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>     at java.util.TimerThread.mainLoop(Timer.java:555)
>     at java.util.TimerThread.run(Timer.java:505)



-- 
Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin

i...@data-artisans.com
+49-30-43208879

Registered at Amtsgericht Charlottenburg - HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Reply via email to