Hi Subramanya,

you can get the logs from Yarn if you enabled log aggregation. If it does
not contain any TM logs, then they were not started.

If Yarn started containers but you don't see them connected to Flink's
ResourceManager, then the TaskManagers either did not start up or they have
problems connecting to the ResourceManager. In order to debug this problem,
the logs would be helpful.

You can configure the cores per container by setting
`yarn.containers.vcores` in your flink-conf.yaml. If this value is not
specified, then it will use the number of slots per TM.

In order to debug the memory settings problem it would be helpful to either
get the full logs or the configuration and the command with which you
started the Flink cluster. From the log snippet it looks as if Flink only
got 8GB of memory assigned.

Cheers,
Till

On Mon, Sep 17, 2018 at 11:34 PM Subramanya Suresh <ssur...@salesforce.com>
wrote:

> I got these logs from one of the Yarn logs. Not sure what changed in
> 1.6.0, couldn't find anything relevant in the release notes.
> Looking through the code i am not sure the JVM Heap Size is < 8GB. We
> start the TM with 20GB, so with the cutoff we should have 
> totalJavaMemorySizeMB
> = 20GB - 5GB i.e. 15GB which is greater than the 8GB.
>
> 2018-09-17 16:06:13,728 ERROR
> org.apache.flink.yarn.YarnTaskExecutorRunner                  - YARN
> TaskManager initialization failed.
> org.apache.flink.configuration.IllegalConfigurationException: Invalid
> configuration value for (taskmanager.network.memory.fraction,
> taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1,
> 8000000000, 12000000000) - Network buffer memory size too large: 8000000000
> >= 7769948160(maximum JVM heap size)
>
> Please also see my questions above.
>
> Cheers,
>
> On Mon, Sep 17, 2018 at 12:19 PM, Subramanya Suresh <
> ssur...@salesforce.com> wrote:
>
>> Thanks Till,
>>
>> "That's also the reason why you don't registered TMs without a running
>> job."
>> > I am not sure what you mean. We see 0 TMs in Flink (attached earlier
>> and also in the TaskManagers link) despite running/submitting the Job (the
>> RM seems to show lot of containers though, attached) .
>> > Also not sure where I get the logs from though without seeing a
>> running TM/Container.
>>
>> How do I restrict the number of containers/cores per container. Seems
>> like -ytm is just a suggestion. I assume parallelism is within the realm
>> of a single container, so I would use 5 to say I want 5 cores within one TM
>> ? Is that again a suggestion only ?
>> I see maxParallelism (set in code only) but that could be 8, if the
>> parallelism I specify is 5.
>>
>> Sincerely,
>>
>> On Mon, Sep 17, 2018 at 1:01 AM, Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> With Flink 1.6.0 it is no longer needed to specify the number of started
>>> containers (-yn 145). Flink will dynamically allocate containers. That's
>>> also the reason why you don't registered TMs without a running job.
>>> Moreover it it recommended to start every container with a single slot (no
>>> -ys 5). The parallelism should be controlled via the -p option or by the
>>> default parallelism configured in flink-conf.yaml.
>>>
>>> The log snippet says that Flink started the TaskManagers. But it seems
>>> as if they could not register at the ResourceManger or could never be
>>> started. Could you check the TM logs to see what they say. If there is
>>> nothing suspicious, then it would be helpful if you could share the
>>> complete logs with us.
>>>
>>> Cheers,
>>> Till
>>>
>>>
>>>
>>> On Mon, Sep 17, 2018 at 9:16 AM Subramanya Suresh <
>>> ssur...@salesforce.com> wrote:
>>>
>>>> Hi,
>>>> Was suggested here to migrate to 1.6.0 in lieu of Akka/TM lost issues
>>>> we were facing with 1.4.2. I got our Yarn cluster setup and launched our
>>>> job with the command mentioned below
>>>>
>>>> Symptoms:
>>>>
>>>>    - The CLI logs say the Job is submitted but Yarn ResourceManager
>>>>    says only 1 container allocated, that goes up on refresh and then a
>>>>    subsequent refresh shows it back to 1 container allocated.
>>>>    - The UI consistently shows 0 TMs and 0 Slots (see attached).
>>>>    - The exceptions in the UI, shows the below
>>>>    NoResourceAvailalbleException.
>>>>    - Also see below the JobManager logs.
>>>>
>>>> So not sure what gives ? I was able to launch the same job in 1.4.2 and
>>>> immediately get the mentioned TMs and have the job working as it should.
>>>>
>>>>
>>>>
>>>> *Job Submit Parameters:*
>>>> nohup $FLINK_BINARY run \
>>>>     -m yarn-cluster \
>>>>     -c $FLINK_JOB_CLASSNAME \
>>>>     -yst \
>>>>     -ys 5 \
>>>>     -yn 145 \
>>>>     -yjm 20000 \
>>>>     -ytm 20000 \
>>>>     -ynm $YARN_APPLICATION_NAME \
>>>>     -d $FLINK_JOB_JAR \
>>>>             > $FLINK_JOB_LOGS/stdout.log \
>>>>             2> $FLINK_JOB_LOGS/stderr.log \
>>>>             & echo $! > $FLINK_JOB_LOGS/current-run.pid
>>>>
>>>> *Exception:*
>>>>
>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>>  Could not allocate all requires slots within timeout of 300000 ms. Slots 
>>>> required: 2, slots allocated: 0
>>>>
>>>>
>>>> *Yarn JobManager Logs:*
>>>>
>>>> 2018-09-17 06:53:18,041 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting
>>>> to ResourceManager akka.tcp://
>>>> fl...@hello-world4-30-crz.ops.sfdc.net:41135/user/resourcemanager(9a62f56ce988f5499dbe1d09bd894b8a)
>>>> 2018-09-17 06:53:18,045 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved
>>>> ResourceManager address, beginning registration
>>>> 2018-09-17 06:53:18,046 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot
>>>> serve slot request, no ResourceManager connected. Adding as pending request
>>>> [SlotRequestId{e1678524024c0d8e7f18b917ad854418}]
>>>> 2018-09-17 06:53:18,046 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>> Registration at ResourceManager attempt 1 (timeout=100ms)
>>>> 2018-09-17 06:53:18,048 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>>>> Starting ZooKeeperLeaderRetrievalService
>>>> /leader/31462809fd71ae1c92a11a58dd2f4d24/job_manager_lock.
>>>> 2018-09-17 06:53:18,048 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Registering
>>>> job manager 8a7f0e49aa68e867ef8f058c46414...@akka.tcp://
>>>> fl...@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job
>>>> 31462809fd71ae1c92a11a58dd2f4d24.
>>>> 2018-09-17 06:53:18,060 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Registered
>>>> job manager 8a7f0e49aa68e867ef8f058c46414...@akka.tcp://
>>>> fl...@hello-world4-30-crz.ops.sfdc.net:41135/user/jobmanager_0 for job
>>>> 31462809fd71ae1c92a11a58dd2f4d24.
>>>> 2018-09-17 06:53:18,062 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager
>>>> successfully registered at ResourceManager, leader id:
>>>> 9a62f56ce988f5499dbe1d09bd894b8a.
>>>> 2018-09-17 06:53:18,062 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting
>>>> new slot [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] and profile
>>>> ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0,
>>>> nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
>>>> 2018-09-17 06:53:18,064 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Request
>>>> slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1,
>>>> directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job
>>>> 31462809fd71ae1c92a11a58dd2f4d24 with allocation id
>>>> AllocationID{8976aac24593aa0d9854fdb569c1d0ac}.
>>>> 2018-09-17 06:53:18,071 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20000, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:53:23,191 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000005 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:53:23,602 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:53:23,603 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:53:34,193 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:53:39,696 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000006 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:53:40,269 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:53:40,270 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:53:45,703 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:53:51,209 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000007 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:53:51,365 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:53:51,365 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:53:51,383 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000009 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:53:51,385 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000009.
>>>> 2018-09-17 06:54:01,714 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:54:07,217 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000011 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:54:07,263 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:54:07,266 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:54:07,276 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000012 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:07,276 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000012.
>>>> 2018-09-17 06:54:07,276 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000013 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:07,276 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000013.
>>>> 2018-09-17 06:54:12,720 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:54:18,221 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000016 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:54:18,256 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:54:18,257 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000017 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000017.
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000018 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000018.
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000020 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:18,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000020.
>>>> 2018-09-17 06:54:28,726 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:54:34,229 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000021 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:54:34,268 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:54:34,269 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000022 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000022.
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000024 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000024.
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000025 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000025.
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000028 - Remaining
>>>> pending container requests: 0
>>>> 2018-09-17 06:54:34,285 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000028.
>>>> 2018-09-17 06:54:39,731 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:54:45,236 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>>> new container: container_e31_1536964973951_0247_01_000042 - Remaining
>>>> pending container requests: 1
>>>> 2018-09-17 06:54:45,281 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Creating
>>>> container launch context for TaskManagers
>>>> 2018-09-17 06:54:45,282 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Starting
>>>> TaskManagers
>>>>
>>>>
>>>>
>>>>
>>>> 2018-09-17 06:58:08,291 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>>> excess container container_e31_1536964973951_0247_01_000595.
>>>> 2018-09-17 06:58:13,403 INFO
>>>> org.apache.flink.yarn.YarnResourceManager                     - Requesting
>>>> new TaskExecutor container with resources <memory:20480, vCores:5>. Number
>>>> pending requests 1.
>>>> 2018-09-17 06:58:18,045 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Pending
>>>> slot request [SlotRequestId{e1678524024c0d8e7f18b917ad854418}] timed out.
>>>> 2018-09-17 06:58:18,047 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>> streaming-searches-test (31462809fd71ae1c92a11a58dd2f4d24) switched from
>>>> state RUNNING to FAILING.
>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>>>> required: 2, slots allocated: 0
>>>> at
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:984)
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>>>> at
>>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>> at
>>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>> at
>>>> org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.handleCompletedFuture(FutureUtils.java:534)
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>>> at
>>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>> at
>>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>> at
>>>> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
>>>> at akka.dispatch.OnComplete.internal(Future.scala:258)
>>>> at akka.dispatch.OnComplete.internal(Future.scala:256)
>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>>>>
>>>> Sincerely,
>>>>
>>>> --
>>>>
>>>> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>
>>>>
>>>
>>
>>
>> --
>>
>> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>
>>
>
>
>
> --
>
> <http://smart.salesforce.com/sig/ssuresh//us_mb/default/link.html>
>

Reply via email to