Thanks Till, "That's also the reason why you don't registered TMs without a running job."
> I am not sure what you mean. We see 0 TMs in Flink (attached earlier and also in the TaskManagers link) despite running/submitting the Job (the RM seems to show lot of containers though, attached) . > Also not sure where I get the logs from though without seeing a running TM/Container. How do I restrict the number of containers/cores per container. Seems like -ytm is just a suggestion. I assume parallelism is within the realm of a single container, so I would use 5 to say I want 5 cores within one TM ? Is that again a suggestion only ? I see maxParallelism (set in code only) but that could be 8, if the parallelism I specify is 5. Sincerely, On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <trohrm...@apache.org> wrote: > With Flink 1.6.0 it is no longer needed to specify the number of started > containers (-yn 145). Flink will dynamically allocate containers. That's > also the reason why you don't registered TMs without a running job. > Moreover it it recommended to start every container with a single slot (no > -ys 5). The parallelism should be controlled via the -p option or by the > default parallelism configured in flink-conf.yaml. > > The log snippet says that Flink started the TaskManagers. But it seems as > if they could not register at the ResourceManger or could never be started. > Could you check the TM logs to see what they say. If there is nothing > suspicious, then it would be helpful if you could share the complete logs > with us. > > Cheers, > Till > > > > On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <ssur...@salesforce.com> > wrote: > >> Hi, >> Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues we >> were facing with 1.4.2. I got our Yarn cluster setup and launched our job >> with the command mentioned below >> >> Symptoms: >> >> - The CLI logs say the Job is submitted but Yarn ResourceManager says >> only 1 container allocated, that goes up on refresh and then a subsequent >> refresh shows it back to 1 container allocated. >> - The UI consistently shows 0 TMs and 0 Slots (see attached). >> - The exceptions in the UI, shows the below >> NoResourceAvailalbleException. >> - Also see below the JobManager logs. >> >> So not sure what gives ? I was able to launch the same job in 1.4.2 and >> immediately get the mentioned TMs and have the job working as it should. >> >> >> >> *Job Submit Parameters:* >> nohup $FLINK_BINARY run \ >> -m yarn-cluster \ >> -c $FLINK_JOB_CLASSNAME \ >> -yst \ >> -ys 5 \ >> -yn 145 \ >> -yjm 20000 \ >> -ytm 20000 \ >> -ynm $YARN_APPLICATION_NAME \ >> -d $FLINK_JOB_JAR \ >> > $FLINK_JOB_LOGS/stdout.log \ >> 2> $FLINK_JOB_LOGS/stderr.log \ >> & echo $! > $FLINK_JOB_LOGS/current-run.pid >> >> *Exception:* >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms. Slots >> required: 2, slots allocated: 0 >> >> >> *Yarn JobManager Logs:* >> >> 2018-09-17 06:53:18,041 INFO org.apache.flink.runtime.jobmaster.JobMaster >> - Connecting to ResourceManager akka.tcp:// >> fl...@hello-world4-30-crz.ops.sfdc.net:41135/user/resourcemanager( >> 9a62f56ce988f5499dbe1d09bd894b8a) >> 2018-09-17 06:53:18,045 INFO org.apache.flink.runtime.jobmaster.JobMaster >> - Resolved ResourceManager address, beginning registration >> 2018-09-17 06:53:18,046 INFO >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >> - Cannot serve slot request, no ResourceManager connected. Adding >> as pending request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] >> 2018-09-17 06:53:18,046 INFO org.apache.flink.runtime.jobmaster.JobMaster >> - Registration at ResourceManager attempt 1 (timeout=100ms) >> 2018-09-17 06:53:18,048 INFO org.apache.flink.runtime.leaderretrieval. >> ZooKeeperLeaderRetrievalService - Starting >> ZooKeeperLeaderRetrievalService /leader/31462809fd71ae1c92a11a58dd2f4d >> 24/job_manager_lock. >> 2018-09-17 06:53:18,048 INFO org.apache.flink.yarn.YarnResourceManager >> - Registering job manager 8a7f0e49aa68e867ef8f058c46414d >> d...@akka.tcp://fl...@hello-world4-30-crz.ops.sfdc.net: >> 41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24. >> 2018-09-17 06:53:18,060 INFO org.apache.flink.yarn.YarnResourceManager >> - Registered job manager 8a7f0e49aa68e867ef8f058c46414d >> d...@akka.tcp://fl...@hello-world4-30-crz.ops.sfdc.net: >> 41135/user/jobmanager_0 for job 31462809fd71ae1c92a11a58dd2f4d24. >> 2018-09-17 06:53:18,062 INFO org.apache.flink.runtime.jobmaster.JobMaster >> - JobManager successfully registered at ResourceManager, >> leader id: 9a62f56ce988f5499dbe1d09bd894b8a. >> 2018-09-17 06:53:18,062 INFO >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >> - Requesting new slot [SlotRequestId{ >> e1678524024c0d8e7f18b917ad854418}] and profile >> ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, >> nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. >> 2018-09-17 06:53:18,064 INFO org.apache.flink.yarn.YarnResourceManager >> - Request slot with profile >> ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, >> nativeMemoryInMB=0, networkMemoryInMB=0} for job >> 31462809fd71ae1c92a11a58dd2f4d24 with allocation id AllocationID{ >> 8976aac24593aa0d9854fdb569c1d0ac}. >> 2018-09-17 06:53:18,071 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20000, vCores:5>. Number pending requests 1. >> 2018-09-17 06:53:23,191 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000005 >> - Remaining pending container requests: 1 >> 2018-09-17 06:53:23,602 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:53:23,603 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:53:34,193 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:53:39,696 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000006 >> - Remaining pending container requests: 1 >> 2018-09-17 06:53:40,269 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:53:40,270 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:53:45,703 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:53:51,209 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000007 >> - Remaining pending container requests: 1 >> 2018-09-17 06:53:51,365 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:53:51,365 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:53:51,383 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000009 >> - Remaining pending container requests: 0 >> 2018-09-17 06:53:51,385 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000009. >> 2018-09-17 06:54:01,714 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:54:07,217 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000011 >> - Remaining pending container requests: 1 >> 2018-09-17 06:54:07,263 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:54:07,266 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000012 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000012. >> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000013 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:07,276 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000013. >> 2018-09-17 06:54:12,720 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:54:18,221 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000016 >> - Remaining pending container requests: 1 >> 2018-09-17 06:54:18,256 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:54:18,257 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000017 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000017. >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000018 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000018. >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000020 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:18,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000020. >> 2018-09-17 06:54:28,726 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:54:34,229 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000021 >> - Remaining pending container requests: 1 >> 2018-09-17 06:54:34,268 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:54:34,269 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000022 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000022. >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000024 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000024. >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000025 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000025. >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000028 >> - Remaining pending container requests: 0 >> 2018-09-17 06:54:34,285 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000028. >> 2018-09-17 06:54:39,731 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:54:45,236 INFO org.apache.flink.yarn.YarnResourceManager >> - Received new container: >> container_e31_1536964973951_0247_01_000042 >> - Remaining pending container requests: 1 >> 2018-09-17 06:54:45,281 INFO org.apache.flink.yarn.YarnResourceManager >> - Creating container launch context for TaskManagers >> 2018-09-17 06:54:45,282 INFO org.apache.flink.yarn.YarnResourceManager >> - Starting TaskManagers >> >> >> >> >> 2018-09-17 06:58:08,291 INFO org.apache.flink.yarn.YarnResourceManager >> - Returning excess container container_e31_1536964973951_ >> 0247_01_000595. >> 2018-09-17 06:58:13,403 INFO org.apache.flink.yarn.YarnResourceManager >> - Requesting new TaskExecutor container with resources >> <memory:20480, vCores:5>. Number pending requests 1. >> 2018-09-17 06:58:18,045 INFO >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >> - Pending slot request [SlotRequestId{ >> e1678524024c0d8e7f18b917ad854418}] timed out. >> 2018-09-17 06:58:18,047 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph >> - Job streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) >> switched from state RUNNING to FAILING. >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms. Slots >> required: 2, slots allocated: 0 >> at org.apache.flink.runtime.executiongraph.ExecutionGraph. >> lambda$scheduleEager$3(ExecutionGraph.java:984) >> at java.util.concurrent.CompletableFuture.uniExceptionally( >> CompletableFuture.java:870) >> at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire( >> CompletableFuture.java:852) >> at java.util.concurrent.CompletableFuture.postComplete( >> CompletableFuture.java:474) >> at java.util.concurrent.CompletableFuture.completeExceptionally( >> CompletableFuture.java:1977) >> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture. >> handleCompletedFuture(FutureUtils.java:534) >> at java.util.concurrent.CompletableFuture.uniWhenComplete( >> CompletableFuture.java:760) >> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire( >> CompletableFuture.java:736) >> at java.util.concurrent.CompletableFuture.postComplete( >> CompletableFuture.java:474) >> at java.util.concurrent.CompletableFuture.completeExceptionally( >> CompletableFuture.java:1977) >> at org.apache.flink.runtime.concurrent.FutureUtils$1. >> onComplete(FutureUtils.java:770) >> at akka.dispatch.OnComplete.internal(Future.scala:258) >> at akka.dispatch.OnComplete.internal(Future.scala:256) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) >> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >> >> Sincerely, >> >> -- >> >> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html> >> > -- <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>