Re: failures during job start

Chesnay Schepler Thu, 19 Aug 2021 11:23:21 -0700

This exception means that a task was deployed, but the task thatproduces the data it wants to consume was not available yet (even afterwaiting for a while).

Your case sounds similar tohttps://issues.apache.org/jira/browse/FLINK-9413, where this happensbecause the deployment of the producing task takes too long.


You have 2 options to solve this:

a) Have Flink wait longer for the partition to be created by increasingtaskmanager.network.request-backoff.maxb) Speed up the deployment; for this you'd naturally have to investigatewhy the deployment takes so long in the first place.


On 19/08/2021 07:15, Colletta, Edward wrote:

Any help with this would be appreciated. Is it possible that this isa data/application issue or a flink config/resource issue?
Using flink 11.2, java 11, session cluster, 5 nodes 32 cores each node.
I have an issue where starting a job takes a long time, and sometimesfails with PartitionNotFoundException, but succeeds on restart. Thejob has 10 kafka sources (10 partitions for each topic) and parallelism 5.
The failure does not happen when the kafka logs are empty.
Note during below scenario, cpu usage on task manager and job managersis low (below 30%)
The scenario we see

  * run request to load and run a jar, job appears on dashboard with
    all 160 subtasks in Deploying state
  * after 2 minutes some subtasks start transitioning to running.
  * after another 30 seconds failure occurs and job goes into
    Restarting state
  * after another minute, restart completes all nodes running.

Exception history shows

2021-08-15 07:55:02
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:Partition205a0867c6ef540009acd962d556f981#0@a6b547c5096f3c33eb9059cfe767a2ecnot found.
atorg.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:267)
atorg.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:166)
atorg.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:521)
atorg.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.lambda$triggerPartitionStateCheck$1(SingleInputGate.java:765)
atjava.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714)
atjava.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
atakka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
atakka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
atakka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
atakka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Re: failures during job start

Reply via email to