Stephan, Thanks for pointing us in the right direction on the different addresses. That was the issue.
David On Wed, May 10, 2017 at 3:03 PM, Stephan Ewen <se...@apache.org> wrote: > Hi! > > Can it be that some hostname / IP address mapping / etc gets thrown off > somewhere in the process? > > This exception looks like the following happens: > > - JobManager gets a message from a TaskManager that a partition is > ready, notifies other TaskManagers > - TaskManager gets the update message, connects to the address of the > indicated TaskManager > - That taskmanager does not have that partition > > Is it possible that JobManager / TaskManager see different names / > addresses? > > Also, is that Flink 1.2, DataSet job? > > Stephan > > > > On Wed, May 10, 2017 at 7:05 PM, David Brelloch <brell...@gmail.com> > wrote: > >> Hi everyone, >> >> We are attempting to run flink 1.2 in a distributed dockerized >> environment and are running into issues when running jobs in parallel. >> >> The exception we are getting fairly quickly after start up is: >> >> org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: >> Partition d3d8404aa26bedafd77e88bdfd88375b@84037703da6706cd1017f53fd8b818cd >> not found. >> at >> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:204) >> at >> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:129) >> at >> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:331) >> at >> org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1244) >> at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1082) >> at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1077) >> at >> org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:259) >> at akka.dispatch.OnComplete.internal(Future.scala:248) >> at akka.dispatch.OnComplete.internal(Future.scala:245) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) >> at >> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) >> at >> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) >> at >> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) >> at >> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) >> at >> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) >> at >> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) >> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> This only occurs when running in parallel but I don't have a lot to go on >> from the exception. We have configured the following ports: >> jobmanager.rpc.port: 6123 >> taskmanager.rpc.port: 6122 >> taskmanager.data.port: 6121 >> >> And have mapped the docker ports 6121 and 6122 on the task managers as >> well as 6123 on the job manager. >> >> Does anyone have any suggestions for other places to look or settings to >> try? >> >> Thanks, >> David >> >> >> >> >