Never mind, I'll post this new problem as a new thread. On Wed, Mar 28, 2018 at 6:35 PM, Juho Autio <juho.au...@rovio.com> wrote:
> Thank you. The YARN job was started now, but the Flink job itself is in > some bad state. > > Flink UI keeps showing status CREATED for all sub-tasks and nothing seems > to be happening. > > ( For the record, this is what I did: export HADOOP_CLASSPATH=`hadoop > classpath` – as found at https://ci.apache.org/proje > cts/flink/flink-docs-master/ops/deployment/hadoop.html ) > > I found this in Job manager log: > > 2018-03-28 15:26:17,449 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched > from state RUNNING to FAILING. > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate all requires slots within timeout of 300000 ms. Slots > required: 20, slots allocated: 8 > at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd > a$scheduleEager$36(ExecutionGraph.java:984) > at java.util.concurrent.CompletableFuture.uniExceptionally(Comp > letableFuture.java:870) > at java.util.concurrent.CompletableFuture$UniExceptionally. > tryFire(CompletableFuture.java:852) > at java.util.concurrent.CompletableFuture.postComplete(Completa > bleFuture.java:474) > at java.util.concurrent.CompletableFuture.completeExceptionally > (CompletableFuture.java:1977) > at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun > ctFuture.handleCompletedFuture(FutureUtils.java:551) > at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl > etableFuture.java:760) > at java.util.concurrent.CompletableFuture$UniWhenComplete. > tryFire(CompletableFuture.java:736) > at java.util.concurrent.CompletableFuture.postComplete(Completa > bleFuture.java:474) > at java.util.concurrent.CompletableFuture.completeExceptionally > (CompletableFuture.java:1977) > at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete > (FutureUtils.java:789) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at org.apache.flink.runtime.concurrent.Executors$DirectExecutio > nContext.execute(Executors.java:83) > at scala.concurrent.impl.CallbackRunnable.executeWithValue( > Promise.scala:44) > at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Pro > mise.scala:252) > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupp > ort.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedE > xecute(Future.scala:601) > at scala.concurrent.BatchingExecutor$class.execute( > BatchingExecutor.scala:109) > at scala.concurrent.Future$InternalCallbackExecutor$.execute( > Future.scala:599) > at akka.actor.LightArrayRevolverScheduler$TaskHolder. > executeTask(LightArrayRevolverScheduler.scala:329) > at akka.actor.LightArrayRevolverScheduler$$anon$4. > executeBucket$1(LightArrayRevolverScheduler.scala:280) > at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(Ligh > tArrayRevolverScheduler.scala:284) > at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArra > yRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > > After this there was: > > 2018-03-28 15:26:17,521 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5d > b93). > > And some time after that: > > 2018-03-28 15:27:39,125 ERROR > org.apache.flink.runtime.blob.BlobServerConnection > - GET operation failed > java.io.EOFException: Premature end of GET request > at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobS > erverConnection.java:275) > at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobS > erverConnection.java:117) > > Task manager logs don't have any errors. > > Is that error about BlobServerConnection severe enough to make the job get > stuck like this? How to debug this further? > > Thanks! > > On Wed, Mar 28, 2018 at 5:56 PM, Gary Yao <g...@data-artisans.com> wrote: > >> Hi Juho, >> >> Can you try submitting with HADOOP_CLASSPATH=`hadoop classpath` set? [1] >> For example: >> HADOOP_CLASSPATH=`hadoop classpath` link-${FLINK_VERSION}/bin/flink >> run [...] >> >> Best, >> Gary >> >> [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/d >> eployment/hadoop.html#configuring-flink-with-hadoop-classpaths >> >> >> On Wed, Mar 28, 2018 at 4:26 PM, Juho Autio <juho.au...@rovio.com> wrote: >> >>> I built a new Flink distribution from release-1.5 branch today. >>> >>> I tried running a job but get this error: >>> java.lang.NoClassDefFoundError: com/sun/jersey/core/util/Featu >>> resAndProperties >>> >>> I use yarn-cluster mode. >>> >>> The jersey-core jar is found in the hadoop lib on my EMR cluster, but >>> seems like it's not used any more. >>> >>> I checked that jersey-core classes are not included in the new >>> distribution, but they were not included in my previously built flink >>> 1.5-SNAPSHOT either, which works. Has something changed recently to >>> cause this? >>> >>> Is this a Flink bug or should I fix this by somehow explicitly telling >>> Flink YARN app to use the hadoop lib now? >>> >>> More details below if needed. >>> >>> Thanks, >>> Juho >>> >>> >>> My launch command is basically: >>> >>> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} >>> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} >>> -yst -yD restart-strategy=fixed-delay -yD >>> restart-strategy.fixed-delay.attempts=3 >>> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@ >>> >>> >>> I'm also setting this to fix some classloading error (with the previous >>> build that still works) >>> -yD.classloader.resolve-order=parent-first >>> >>> >>> Error stack trace: >>> >>> java.lang.NoClassDefFoundError: com/sun/jersey/core/util/Featu >>> resAndProperties >>> at java.lang.ClassLoader.defineClass1(Native Method) >>> at java.lang.ClassLoader.defineClass(ClassLoader.java:763) >>> at java.security.SecureClassLoader.defineClass(SecureClassLoade >>> r.java:142) >>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) >>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73) >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368) >>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) >>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>> at org.apache.hadoop.yarn.client.api.TimelineClient.createTimel >>> ineClient(TimelineClient.java:55) >>> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.create >>> TimelineClient(YarnClientImpl.java:181) >>> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.servic >>> eInit(YarnClientImpl.java:168) >>> at org.apache.hadoop.service.AbstractService.init(AbstractServi >>> ce.java:163) >>> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.getClusterDesc >>> riptor(FlinkYarnSessionCli.java:971) >>> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescript >>> or(FlinkYarnSessionCli.java:273) >>> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterD >>> escriptor(FlinkYarnSessionCli.java:449) >>> at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createClusterD >>> escriptor(FlinkYarnSessionCli.java:92) >>> at org.apache.fliCommand exiting with ret '31' >>> >>> >> >