Hey Gyula! I'm including Piotr and Nico (cc'd) who have worked on the network stack in the last releases.
Registering the network structures including the intermediate results actually happens **before** any state is restored. I'm not sure why this reproducibly happens when you restore state. @Nico, Piotr: any ideas here? In general I think what happens here is the following: - a task requests the result of a local upstream producer, but that one has not registered its intermediate result yet - this should result in a retry of the request with some backoff (controlled via the config params you mention taskmanager.network.request-backoff.max, taskmanager.network.request-backoff.initial) As a first step I would set logging to DEBUG and check the TM logs for messages like "Retriggering partition request {}:{}." You can also check the SingleInputGate code which has the logic for retriggering requests. – Ufuk On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: > Hi Ufuk, > > Do you have any quick idea what could cause this problems in flink 1.4.2? > Seems like one operator takes too long to deploy and downstream tasks error > out on partition not found. This only seems to happen when the job is > restored from state and in fact that operator has some keyed and operator > state as well. > > Deploying the same job from empty state works well. We tried increasing the > taskmanager.network.request-backoff.max that didnt help. > > It would be great if you have some pointers where to look further, I havent > seen this happening before. > > Thank you! > Gyula > > The errror: > org.apache.flink.runtime.io.network.partition.: Partition > 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found. > at > org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77) > at > org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115) > at > org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) -- Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin i...@data-artisans.com +49-30-43208879 Registered at Amtsgericht Charlottenburg - HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen