[ https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419245#comment-16419245 ]
Nico Kruber commented on FLINK-9010: ------------------------------------ Sorry for the late response. * This distinction between "logical slots" and "physical slots" is confusing and I somehow doubt that this is documented (but maybe I'm wrong here). We should probably at least adapt the log messages for something that the user is expecting when he's reading "slots". * Regarding the actual issue: there should always have been enough machines to offer all needed slots but I cannot rule out that some EC2 instance was unresponsive or currently restarting due to some failure. I think, [~pnowojski] did some more experiments with big cluster setups recently and I haven't heard of that occurring again (correct me if I'm wrong). If it did not occur again, we may close this issue as it may have been an EC2 hickup. > NoResourceAvailableException with FLIP-6 > ----------------------------------------- > > Key: FLINK-9010 > URL: https://issues.apache.org/jira/browse/FLINK-9010 > Project: Flink > Issue Type: Bug > Components: ResourceManager > Affects Versions: 1.5.0 > Reporter: Nico Kruber > Assignee: Nico Kruber > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) > with FLIP-6 mode and a checkpointing interval of 1000 and got the following > exception: > {code} > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - Received new container: > container_1521038088305_0257_01_000101 - Remaining pending container > requests: 302 > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - TaskExecutor container_1521038088305_0257_01_000101 will be > started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory > limit 3072 MB > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote keytab path obtained null > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote keytab principal obtained null > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote yarn conf path obtained null > 2018-03-16 03:41:20,154 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote krb5 path obtained null > 2018-03-16 03:41:20,155 INFO org.apache.flink.yarn.Utils > - Copying from > file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml > to > hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml > 2018-03-16 03:41:20,165 INFO org.apache.flink.yarn.YarnResourceManager > - Prepared local resource for modified yaml: resource { scheme: > "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: > "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml" > } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION > 2018-03-16 03:41:20,168 INFO org.apache.flink.yarn.YarnResourceManager > - Creating container launch context for TaskManagers > 2018-03-16 03:41:20,168 INFO org.apache.flink.yarn.YarnResourceManager > - Starting TaskManagers with command: $JAVA_HOME/bin/java > -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m > -Dlog.file=<LOG_DIR>/taskmanager.log > -Dlogback.configurationFile=file:./logback.xml > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> > <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err > 2018-03-16 03:41:20,176 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041 > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - Received new container: > container_1521038088305_0257_01_000102 - Remaining pending container > requests: 301 > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - TaskExecutor container_1521038088305_0257_01_000102 will be > started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory > limit 3072 MB > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote keytab path obtained null > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote keytab principal obtained null > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote yarn conf path obtained null > 2018-03-16 03:41:20,180 INFO org.apache.flink.yarn.YarnResourceManager > - TM:remote krb5 path obtained null > 2018-03-16 03:41:20,181 INFO org.apache.flink.yarn.Utils > - Copying from > file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml > to > hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml > 2018-03-16 03:41:20,190 INFO org.apache.flink.yarn.YarnResourceManager > - Prepared local resource for modified yaml: resource { scheme: > "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: > "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml" > } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION > 2018-03-16 03:41:20,194 INFO org.apache.flink.yarn.YarnResourceManager > - Creating container launch context for TaskManagers > 2018-03-16 03:41:20,194 INFO org.apache.flink.yarn.YarnResourceManager > - Starting TaskManagers with command: $JAVA_HOME/bin/java > -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m > -Dlog.file=<LOG_DIR>/taskmanager.log > -Dlogback.configurationFile=file:./logback.xml > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> > <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err > 2018-03-16 03:41:20,203 INFO > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - > Opening proxy : ip-172-31-1-233.eu-west-1.compute.internal:8041 > 2018-03-16 03:41:20,713 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager 5fb7473a7738ef09e2c1fe8c5fc46e1e at the SlotManager. > 2018-03-16 03:41:20,938 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > triggering task Source: Custom Source (1/400) is not being executed at the > moment. Aborting checkpoint. > 2018-03-16 03:41:21,611 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager a078410d60d99351c0f54691c0beb5ed at the SlotManager. > 2018-03-16 03:41:21,938 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > triggering task Source: Custom Source (1/400) is not being executed at the > moment. Aborting checkpoint. > 2018-03-16 03:41:21,972 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager 6980e6ba9ce4945c7b2e0ede5130c7dc at the SlotManager. > 2018-03-16 03:41:22,938 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > triggering task Source: Custom Source (1/400) is not being executed at the > moment. Aborting checkpoint. > 2018-03-16 03:41:23,938 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > triggering task Source: Custom Source (1/400) is not being executed at the > moment. Aborting checkpoint. > 2018-03-16 03:41:24,882 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager f7401aa710e890b811de8e415f34a61b at the SlotManager. > 2018-03-16 03:41:24,883 INFO org.apache.flink.yarn.YarnResourceManager > - Replacing old instance of worker for ResourceID > container_1521038088305_0257_01_000041 > 2018-03-16 03:41:24,883 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Unregister TaskManager f7401aa710e890b811de8e415f34a61b from the SlotManager. > 2018-03-16 03:41:24,883 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager e89e020bc7ebccf07849e326b08b6b73 at the SlotManager. > 2018-03-16 03:41:24,884 INFO org.apache.flink.yarn.YarnResourceManager > - The target with resource ID > container_1521038088305_0257_01_000041 is already been monitored. > 2018-03-16 03:41:24,885 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 301. > 2018-03-16 03:41:24,885 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 302. > 2018-03-16 03:41:24,885 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 303. > 2018-03-16 03:41:24,885 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 304. > 2018-03-16 03:41:24,937 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 305. > 2018-03-16 03:41:24,938 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 306. > 2018-03-16 03:41:24,938 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > triggering task Source: Custom Source (1/400) is not being executed at the > moment. Aborting checkpoint. > 2018-03-16 03:41:24,938 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 307. > 2018-03-16 03:41:24,939 INFO org.apache.flink.yarn.YarnResourceManager > - Requesting new TaskExecutor container with resources > <memory:8192, vCores:16>. Number pending requests 308. > 2018-03-16 03:41:25,255 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Register > TaskManager a075d154164bab5500f42a0aad7312ad at the SlotManager. > ... > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate all requires slots within timeout of 300000 ms. Slots > required: 800, slots allocated: 792 > at > org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$2(ExecutionGraph.java:997) > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:517) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:755) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)