[jira] [Commented] (FLINK-9010) NoResourceAvailableException with FLIP-6

Nico Kruber (JIRA) Thu, 29 Mar 2018 09:01:28 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419245#comment-16419245
 ]


Nico Kruber commented on FLINK-9010:
------------------------------------

Sorry for the late response.
 * This distinction between "logical slots" and "physical slots" is confusing 
and I somehow doubt that this is documented (but maybe I'm wrong here). We 
should probably at least adapt the log messages for something that the user is 
expecting when he's reading "slots".
 * Regarding the actual issue: there should always have been enough machines to 
offer all needed slots but I cannot rule out that some EC2 instance was 
unresponsive or currently restarting due to some failure. I think, [~pnowojski] 
did some more experiments with big cluster setups recently and I haven't heard 
of that occurring again (correct me if I'm wrong). If it did not occur again, 
we may close this issue as it may have been an EC2 hickup.

> NoResourceAvailableException with FLIP-6 
> -----------------------------------------
>
>                 Key: FLINK-9010
>                 URL: https://issues.apache.org/jira/browse/FLINK-9010
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>    Affects Versions: 1.5.0
>            Reporter: Nico Kruber
>            Assignee: Nico Kruber
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils                     
>               - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=<LOG_DIR>/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils                     
>               - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_000001/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=<LOG_DIR>/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> <LOG_DIR>/taskmanager.out 2> <LOG_DIR>/taskmanager.err
> 2018-03-16 03:41:20,203 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-1-233.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,713 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 5fb7473a7738ef09e2c1fe8c5fc46e1e at the SlotManager.
> 2018-03-16 03:41:20,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:21,611 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager a078410d60d99351c0f54691c0beb5ed at the SlotManager.
> 2018-03-16 03:41:21,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:21,972 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 6980e6ba9ce4945c7b2e0ede5130c7dc at the SlotManager.
> 2018-03-16 03:41:22,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:23,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:24,882 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager f7401aa710e890b811de8e415f34a61b at the SlotManager.
> 2018-03-16 03:41:24,883 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Replacing old instance of worker for ResourceID 
> container_1521038088305_0257_01_000041
> 2018-03-16 03:41:24,883 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - 
> Unregister TaskManager f7401aa710e890b811de8e415f34a61b from the SlotManager.
> 2018-03-16 03:41:24,883 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager e89e020bc7ebccf07849e326b08b6b73 at the SlotManager.
> 2018-03-16 03:41:24,884 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - The target with resource ID 
> container_1521038088305_0257_01_000041 is already been monitored.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 301.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 302.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 303.
> 2018-03-16 03:41:24,885 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 304.
> 2018-03-16 03:41:24,937 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 305.
> 2018-03-16 03:41:24,938 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 306.
> 2018-03-16 03:41:24,938 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint 
> triggering task Source: Custom Source (1/400) is not being executed at the 
> moment. Aborting checkpoint.
> 2018-03-16 03:41:24,938 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 307.
> 2018-03-16 03:41:24,939 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Requesting new TaskExecutor container with resources 
> <memory:8192, vCores:16>. Number pending requests 308.
> 2018-03-16 03:41:25,255 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager a075d154164bab5500f42a0aad7312ad at the SlotManager.
> ...
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate all requires slots within timeout of 300000 ms. Slots 
> required: 800, slots allocated: 792
>       at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$2(ExecutionGraph.java:997)
>       at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>       at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:517)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:755)
>       at akka.dispatch.OnComplete.internal(Future.scala:258)
>       at akka.dispatch.OnComplete.internal(Future.scala:256)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>       at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
>       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>       at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9010) NoResourceAvailableException with FLIP-6

Reply via email to