Thank you. The YARN job was started now, but the Flink job itself is in some bad state.
Flink UI keeps showing status CREATED for all sub-tasks and nothing seems to be happening. ( For the record, this is what I did: export HADOOP_CLASSPATH=`hadoop classpath` – as found at https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/hadoop.html ) I found this in Job manager log: 2018-03-28 15:26:17,449 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched from state RUNNING to FAILING. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 20, slots allocated: 8 at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$36(ExecutionGraph.java:984) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:551) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:789) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748) After this there was: 2018-03-28 15:26:17,521 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93). And some time after that: 2018-03-28 15:27:39,125 ERROR org.apache.flink.runtime.blob.BlobServerConnection - GET operation failed java.io.EOFException: Premature end of GET request at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:275) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117) Task manager logs don't have any errors. Is that error about BlobServerConnection severe enough to make the job get stuck like this? How to debug this further? Thanks! On Wed, Mar 28, 2018 at 5:56 PM, Gary Yao <g...@data-artisans.com> wrote: > Hi Juho, > > Can you try submitting with HADOOP_CLASSPATH=`hadoop classpath` set? [1] > For example: > HADOOP_CLASSPATH=`hadoop classpath` link-${FLINK_VERSION}/bin/flink run > [...] > > Best, > Gary > > [1] https://ci.apache.org/projects/flink/flink-docs- > master/ops/deployment/hadoop.html#configuring-flink-with-hadoop-classpaths > > > On Wed, Mar 28, 2018 at 4:26 PM, Juho Autio <juho.au...@rovio.com> wrote: > >> I built a new Flink distribution from release-1.5 branch today. >> >> I tried running a job but get this error: >> java.lang.NoClassDefFoundError: com/sun/jersey/core/util/Featu >> resAndProperties >> >> I use yarn-cluster mode. >> >> The jersey-core jar is found in the hadoop lib on my EMR cluster, but >> seems like it's not used any more. >> >> I checked that jersey-core classes are not included in the new >> distribution, but they were not included in my previously built flink >> 1.5-SNAPSHOT either, which works. Has something changed recently to >> cause this? >> >> Is this a Flink bug or should I fix this by somehow explicitly telling >> Flink YARN app to use the hadoop lib now? >> >> More details below if needed. >> >> Thanks, >> Juho >> >> >> My launch command is basically: >> >> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} >> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} >> -yst -yD restart-strategy=fixed-delay -yD >> restart-strategy.fixed-delay.attempts=3 >> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@ >> >> >> I'm also setting this to fix some classloading error (with the previous >> build that still works) >> -yD.classloader.resolve-order=parent-first >> >> >> Error stack trace: >> >> java.lang.NoClassDefFoundError: com/sun/jersey/core/util/Featu >> resAndProperties >> at java.lang.ClassLoader.defineClass1(Native Method) >> at java.lang.ClassLoader.defineClass(ClassLoader.java:763) >> at java.security.SecureClassLoader.defineClass(SecureClassLoade >> r.java:142) >> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) >> at java.net.URLClassLoader.access$100(URLClassLoader.java:73) >> at java.net.URLClassLoader$1.run(URLClassLoader.java:368) >> at java.net.URLClassLoader$1.run(URLClassLoader.java:362) >> at java.security.AccessController.doPrivileged(Native Method) >> at java.net.URLClassLoader.findClass(URLClassLoader.java:361) >> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) >> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >> at org.apache.hadoop.yarn.client.api.TimelineClient.createTimel >> ineClient(TimelineClient.java:55) >> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.create >> TimelineClient(YarnClientImpl.java:181) >> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.servic >> eInit(YarnClientImpl.java:168) >> at org.apache.hadoop.service.AbstractService.init(AbstractServi >> ce.java:163) >> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.getClusterDesc >> riptor(FlinkYarnSessionCli.java:971) >> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescript >> or(FlinkYarnSessionCli.java:273) >> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterD >> escriptor(FlinkYarnSessionCli.java:449) >> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterD >> escriptor(FlinkYarnSessionCli.java:92) >> at org.apache.fliCommand exiting with ret '31' >> >> >