PartitionNotFoundException when running in yarn-session.

Niels Basjes Mon, 09 Oct 2017 01:51:57 -0700

Hi,

I'm having some trouble running a java based Flink job in a yarn-session.


The job itself consists of reading a set of files resulting in a DataStream
(I use DataStream because in the future I intend to change the file with a
Kafka feed), then does some parsing and eventually writes the data into
HBase.

Most of the time running this works fine yet sometimes it fails with this
exception:

org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:
Partition 794b5ce385c296b7943fa4c1f072d6b9@13aa7ef02a5d9e0898204ec8ce283363
not found.
        at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:203)
        at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:128)
        at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:345)
        at 
org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1286)
        at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1123)
        at org.apache.flink.runtime.taskmanager.Task$2.apply(Task.java:1118)
        at 
org.apache.flink.runtime.concurrent.impl.FlinkFuture$5.onComplete(FlinkFuture.java:272)
        at akka.dispatch.OnComplete.internal(Future.scala:248)
        at akka.dispatch.OnComplete.internal(Future.scala:245)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at 
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
        at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
        at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
        at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
        at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at 
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I went through all logs at the Hadoop side of all the related containers
and other than this exception I did not see any warning/error that might
explain what is going on here.

Now the "Most of the time running this works fine" makes this hard to
troubleshoot. When I run the same job again it may run perfectly that time.

I'm using flink-1.3.2-bin-hadoop27-scala_2.11.tgz and I double checked my
pom.xml and I use the same version for Flink / Scala in there.

The command used to start the yarn-session on my experimental cluster (no
security, no other users):

/usr/local/flink-1.3.2/bin/yarn-session.sh \
    --container 180 \
    --name "Flink on Yarn Experiments" \
    --slots                     1     \
    --jobManagerMemory          4000  \
    --taskManagerMemory         4000  \
    --streaming                       \
    --detached

Two relevant fragments from my application pom.xml:

<flink.version>1.3.2</flink.version>
<flink.scala.version>2.11</flink.scala.version>



<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-java</artifactId>
  <version>${flink.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-java_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-clients_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-hbase_${flink.scala.version}</artifactId>
  <version>${flink.version}</version>
</dependency>


I could really use some suggestions where to look for the root cause of
this.
Is this something in my application? My Hadoop cluster? Or is this a
problem in Flink 1.3.2?

Thanks.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

PartitionNotFoundException when running in yarn-session.

Reply via email to