I got these logs from one of the Yarn logs. Not sure what changed in 1.6.0, couldn't find anything relevant in the release notes. Looking through the code i am not sure the JVM Heap Size is < 8GB. We start the TM with 20GB, so with the cutoff we should have totalJavaMemorySizeMB = 20GB - 5GB i.e. 15GB which is greater than the 8GB.
2018-09-17 16:06:13,728 ERROR org.apache.flink.yarn.YarnTaskExecutorRunner - YARN TaskManager initialization failed. org.apache.flink.configuration.IllegalConfigurationException: Invalid configuration value for (taskmanager.network.memory.fraction, taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1, 8000000000, 12000000000) - Network buffer memory size too large: 8000000000 >= 7769948160(maximum JVM heap size) Please also see my questions above. Cheers, On Mon, Sep 17, 2018 at 12:19 PM, Subramanya Suresh <ssur...@salesforce.com> wrote: > Thanks Till, > > "That's also the reason why you don't registered TMs without a running > job." > > I am not sure what you mean. We see 0 TMs in Flink (attached earlier and > also in the TaskManagers link) despite running/submitting the Job (the RM > seems to show lot of containers though, attached) . > > Also not sure where I get the logs from though without seeing a running > TM/Container. > > How do I restrict the number of containers/cores per container. Seems like > -ytm is just a suggestion. I assume parallelism is within the realm of a > single container, so I would use 5 to say I want 5 cores within one TM ? Is > that again a suggestion only ? > I see maxParallelism (set in code only) but that could be 8, if the > parallelism I specify is 5. > > Sincerely, > > On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> With Flink 1.6.0 it is no longer needed to specify the number of started >> containers (-yn 145). Flink will dynamically allocate containers. That's >> also the reason why you don't registered TMs without a running job. >> Moreover it it recommended to start every container with a single slot (no >> -ys 5). The parallelism should be controlled via the -p option or by the >> default parallelism configured in flink-conf.yaml. >> >> The log snippet says that Flink started the TaskManagers. But it seems as >> if they could not register at the ResourceManger or could never be started. >> Could you check the TM logs to see what they say. If there is nothing >> suspicious, then it would be helpful if you could share the complete logs >> with us. >> >> Cheers, >> Till >> >> >> >> On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <ssur...@salesforce.com> >> wrote: >> >>> Hi, >>> Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we >>> were facing with 1.4.2. I got our Yarn cluster setup and launched our job >>> with the command mentioned below >>> >>> Symptoms: >>> >>> - The CLI logs say the Job is submitted but Yarn ResourceManager >>> says only 1 container allocated, that goes up on refresh and then a >>> subsequent refresh shows it back to 1 container allocated. >>> - The UI consistently shows 0 TMs and 0 Slots (see attached). >>> - The exceptions in the UI, shows the below >>> NoResourceAvailalbleException. >>> - Also see below the JobManager logs. >>> >>> So not sure what gives ? I was able to launch the same job in 1.4.2 and >>> immediately get the mentioned TMs and have the job working as it should. >>> >>> >>> >>> *Job Submit Parameters:* >>> nohup $FLINK_BINARY run \ >>> -m yarn-cluster \ >>> -c $FLINK_JOB_CLASSNAME \ >>> -yst \ >>> -ys 5 \ >>> -yn 145 \ >>> -yjm 20000 \ >>> -ytm 20000 \ >>> -ynm $YARN_APPLICATION_NAME \ >>> -d $FLINK_JOB_JAR \ >>> > $FLINK_JOB_LOGS/stdout.log \ >>> 2> $FLINK_JOB_LOGS/stderr.log \ >>> & echo $! > $FLINK_JOB_LOGS/current-run.pid >>> >>> *Exception:* >>> >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>> required: 2, slots allocated: 0 >>> >>> >>> *Yarn JobManager Logs:* >>> >>> 2018-09-17 06:53:18,041 INFO org.apache.flink.runtime.jobmaster.JobMaster >>> - Connecting to ResourceManager akka.tcp:// >>> fl...@hello-world4-30-crz.ops.sfdc.net:41135/user >>> /resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a) >>> 2018-09-17 06:53:18,045 INFO org.apache.flink.runtime.jobmaster.JobMaster >>> - Resolved ResourceManager address, beginning registration >>> 2018-09-17 06:53:18,046 INFO >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>> - Cannot serve slot request, no ResourceManager connected. Adding >>> as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] >>> 2018-09-17 06:53:18,046 INFO org.apache.flink.runtime.jobmaster.JobMaster >>> - Registration at ResourceManager attempt 1 (timeout=100ms) >>> 2018-09-17 06:53:18,048 INFO org.apache.flink.runtime.leade >>> rretrieval.ZooKeeperLeaderRetrievalService - Starting >>> ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a >>> 58dd2f4d24/job_manager_lock. >>> 2018-09-17 06:53:18,048 INFO org.apache.flink.yarn.YarnResourceManager >>> - Registering job manager 8a7f0e49aa68e867ef8f058c46414d >>> d...@akka.tcp://fl...@hello-world4-30-crz.ops.sfdc.net:41135/ >>> user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24. >>> 2018-09-17 06:53:18,060 INFO org.apache.flink.yarn.YarnResourceManager >>> - Registered job manager 8a7f0e49aa68e867ef8f058c46414d >>> d...@akka.tcp://fl...@hello-world4-30-crz.ops.sfdc.net:41135/ >>> user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24. >>> 2018-09-17 06:53:18,062 INFO org.apache.flink.runtime.jobmaster.JobMaster >>> - JobManager successfully registered at ResourceManager, >>> leader id: 9a62f56ce988f5499dbe1d09bd894b8a. >>> 2018-09-17 06:53:18,062 INFO >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>> - Requesting new slot >>> [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] >>> and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, >>> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource >>> manager. >>> 2018-09-17 06:53:18,064 INFO org.apache.flink.yarn.YarnResourceManager >>> - Request slot with profile >>> ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, >>> nativeMemoryInMB=0, networkMemoryInMB=0} for job >>> 31462809fd71ae1c92a11a58dd2f4d24 with allocation id >>> AllocationID{8976aac24593aa0d9854fdb569c1d0ac}. >>> 2018-09-17 06:53:18,071 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20000, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:53:23,191 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000005 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:53:23,602 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:53:23,603 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:53:34,193 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:53:39,696 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000006 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:53:40,269 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:53:40,270 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:53:45,703 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:53:51,209 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000007 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:53:51,365 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:53:51,365 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:53:51,383 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000009 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:53:51,385 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000009. >>> 2018-09-17 06:54:01,714 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:54:07,217 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000011 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:54:07,263 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:54:07,266 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000012 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000012. >>> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000013 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000013. >>> 2018-09-17 06:54:12,720 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:54:18,221 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000016 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:54:18,256 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:54:18,257 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000017 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000017. >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000018 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000018. >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000020 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000020. >>> 2018-09-17 06:54:28,726 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:54:34,229 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000021 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:54:34,268 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:54:34,269 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000022 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000022. >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000024 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000024. >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000025 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000025. >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000028 >>> - Remaining pending container requests: 0 >>> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000028. >>> 2018-09-17 06:54:39,731 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:54:45,236 INFO org.apache.flink.yarn.YarnResourceManager >>> - Received new container: >>> container_e31_1536964973951_0247_01_000042 >>> - Remaining pending container requests: 1 >>> 2018-09-17 06:54:45,281 INFO org.apache.flink.yarn.YarnResourceManager >>> - Creating container launch context for TaskManagers >>> 2018-09-17 06:54:45,282 INFO org.apache.flink.yarn.YarnResourceManager >>> - Starting TaskManagers >>> >>> >>> >>> >>> 2018-09-17 06:58:08,291 INFO org.apache.flink.yarn.YarnResourceManager >>> - Returning excess container >>> container_e31_1536964973951_0247_01_000595. >>> 2018-09-17 06:58:13,403 INFO org.apache.flink.yarn.YarnResourceManager >>> - Requesting new TaskExecutor container with resources >>> <memory:20480, vCores:5>. Number pending requests 1. >>> 2018-09-17 06:58:18,045 INFO >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>> - Pending slot request >>> [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] >>> timed out. >>> 2018-09-17 06:58:18,047 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>> - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) >>> switched from state RUNNING to FAILING. >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>> required: 2, slots allocated: 0 >>> at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd >>> a$scheduleEager$3(ExecutionGraph.java:984) >>> at java.util.concurrent.CompletableFuture.uniExceptionally(Comp >>> letableFuture.java:870) >>> at java.util.concurrent.CompletableFuture$UniExceptionally. >>> tryFire(CompletableFuture.java:852) >>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>> bleFuture.java:474) >>> at java.util.concurrent.CompletableFuture.completeExceptionally >>> (CompletableFuture.java:1977) >>> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun >>> ctFuture.handleCompletedFuture(FutureUtils.java:534) >>> at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl >>> etableFuture.java:760) >>> at java.util.concurrent.CompletableFuture$UniWhenComplete. >>> tryFire(CompletableFuture.java:736) >>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>> bleFuture.java:474) >>> at java.util.concurrent.CompletableFuture.completeExceptionally >>> (CompletableFuture.java:1977) >>> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete >>> (FutureUtils.java:770) >>> at akka.dispatch.OnComplete.internal(Future.scala:258) >>> at akka.dispatch.OnComplete.internal(Future.scala:256) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) >>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >>> >>> Sincerely, >>> >>> -- >>> >>> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html> >>> >> > > > -- > > <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html> > -- <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>