All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Juho Autio Wed, 28 Mar 2018 23:59:50 -0700

I built a new Flink distribution from release-1.5 branch yesterday.

The first time I tried to run a job with it ended up in some stalled state
so that the job didn't manage to (re)start but what makes it worse is that
it didn't exit as failed either.


Next time I tried running the same job (but new EMR cluster & all from
scratch) it just worked normally.

On the problematic run, The YARN job was started and Flink UI was being
served, but Flink UI kept showing status CREATED for all sub-tasks and
nothing seemed to be happening.

I found this in Job manager log first (could be unrelated) :

2018-03-28 15:26:17,449 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state
RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 20, slots allocated: 8
at
org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984)
at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)


After this there was:

2018-03-28 15:26:17,521 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting
the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93).


And some time after that:

2018-03-28 15:27:39,125 ERROR
org.apache.flink.runtime.blob.BlobServerConnection            - GET
operation failed
java.io.EOFException: Premature end of GET request
at
org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275)
at
org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)


Task manager logs didn't have any errors.


Is that error about BlobServerConnection severe enough to make the job get
stuck like this? Seems like a Flink bug, at least that it just gets stuck
and doesn't either restart or make the YARN app exit as failed?



My launch command is basically:

flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} -ys
${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} -yst
-yD restart-strategy=fixed-delay -yD
restart-strategy.fixed-delay.attempts=3 -yD
"restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@


I'm also setting this to fix some classloading error (with the previous
build that still works)
-yD.classloader.resolve-order=parent-first


Cluster was AWS EMR, release 5.12.0.

Thanks.

All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Reply via email to