Hi! Can it be that some hostname / IP address mapping / etc gets thrown off somewhere in the process?
This exception looks like the following happens: - JobManager gets a message from a TaskManager that a partition is ready, notifies other TaskManagers - TaskManager gets the update message, connects to the address of the indicated TaskManager - That taskmanager does not have that partition Is it possible that JobManager / TaskManager see different names / addresses? Also, is that Flink 1.2, DataSet job? Stephan On Wed, May 10, 2017 at 7:05 PM, David Brelloch <brell...@gmail.com> wrote: > Hi everyone, > > We are attempting to run flink 1.2 in a distributed dockerized environment > and are running into issues when running jobs in parallel. > > The exception we are getting fairly quickly after start up is: > > org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: > Partition d3d8404aa26bedafd77e88bdfd88375b@84037703da6706cd1017f53fd8b818cd > not found. > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:204) > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:129) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:331) > at > org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1244) > at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1082) > at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1077) > at > org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:259) > at akka.dispatch.OnComplete.internal(Future.scala:248) > at akka.dispatch.OnComplete.internal(Future.scala:245) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > This only occurs when running in parallel but I don't have a lot to go on > from the exception. We have configured the following ports: > jobmanager.rpc.port: 6123 > taskmanager.rpc.port: 6122 > taskmanager.data.port: 6121 > > And have mapped the docker ports 6121 and 6122 on the task managers as > well as 6123 on the job manager. > > Does anyone have any suggestions for other places to look or settings to > try? > > Thanks, > David > > > >