Hi, I'm having some trouble running a java based Flink job in a yarn-session.
The job itself consists of reading a set of files resulting in a DataStream (I use DataStream because in the future I intend to change the file with a Kafka feed), then does some parsing and eventually writes the data into HBase. Most of the time running this works fine yet sometimes it fails with this exception: org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: Partition 794b5ce385c296b7943fa4c1f072d6b9@13aa7ef02a5d9e0898204ec8ce283363 not found. at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:203) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:128) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:345) at org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1286) at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1123) at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1118) at org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:272) at akka.dispatch.OnComplete.internal(Future.scala:248) at akka.dispatch.OnComplete.internal(Future.scala:245) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I went through all logs at the Hadoop side of all the related containers and other than this exception I did not see any warning/error that might explain what is going on here. Now the "Most of the time running this works fine" makes this hard to troubleshoot. When I run the same job again it may run perfectly that time. I'm using flink-1.3.2-bin-hadoop27-scala_2.11.tgz and I double checked my pom.xml and I use the same version for Flink / Scala in there. The command used to start the yarn-session on my experimental cluster (no security, no other users): /usr/local/flink-1.3.2/bin/yarn-session.sh \ --container 180 \ --name "Flink on Yarn Experiments" \ --slots 1 \ --jobManagerMemory 4000 \ --taskManagerMemory 4000 \ --streaming \ --detached Two relevant fragments from my application pom.xml: <flink.version>1.3.2</flink.version> <flink.scala.version>2.11</flink.scala.version> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_${flink.scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_${flink.scala.version}</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-hbase_${flink.scala.version}</artifactId> <version>${flink.version}</version> </dependency> I could really use some suggestions where to look for the root cause of this. Is this something in my application? My Hadoop cluster? Or is this a problem in Flink 1.3.2? Thanks. -- Best regards / Met vriendelijke groeten, Niels Basjes