Forget my last email. I received the on time code and could access the logs.

Cheers,
Till

On Sat, Oct 26, 2019 at 6:49 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Regina,
>
> I couldn't access the log files because LockBox asked to create a new
> password and now it asks me for the one time code to confirm this change.
> It says that it will send the one time code to my registered email which I
> don't have.
>
> Cheers,
> Till
>
> On Fri, Oct 25, 2019 at 10:14 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Great, thanks a lot Regina. I'll check the logs tomorrow. If info level
>> is not enough, then I'll let you know.
>>
>> Cheers,
>> Till
>>
>> On Fri, Oct 25, 2019, 21:20 Chan, Regina <regina.c...@gs.com> wrote:
>>
>>> Till, I added you to this lockbox area where you should be able to
>>> download the logs. You should have also received an email with an account
>>> created in lockbox where you can set a password. Let me know if you have
>>> any issues.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Till Rohrmann <trohrm...@apache.org>
>>> *Sent:* Friday, October 25, 2019 1:24 PM
>>> *To:* Chan, Regina [Engineering] <regina.c...@ny.email.gs.com>
>>> *Cc:* Yang Wang <danrtsey...@gmail.com>; user <user@flink.apache.org>
>>> *Subject:* Re: The RMClient's and YarnResourceManagers internal state
>>> about the number of pending container requests has diverged
>>>
>>>
>>>
>>> Could you provide me with the full logs of the cluster
>>> entrypoint/JobManager. I'd like to see what's going on there.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Till
>>>
>>>
>>>
>>> On Fri, Oct 25, 2019, 19:10 Chan, Regina <regina.c...@gs.com> wrote:
>>>
>>> Till,
>>>
>>>
>>>
>>> We’re still seeing a large number of returned containers even with this
>>> heart beat set to something higher. Do you have hints as to what’s going
>>> on? It seems to be bursty in nature. The bursty requests cause the job to
>>> fail with the cluster not having enough resources because it’s in the
>>> process of releasing them.
>>>
>>> “org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not allocate enough slots to run the job. Please make sure that the
>>> cluster has enough resources.” It causes the job to run very
>>> inconsistently.
>>>
>>>
>>>
>>> Since legacy mode is now gone in 1.9, we don’t really see many options
>>> here.
>>>
>>>
>>>
>>> *Run Profile*
>>>
>>> *Number of returned excess containers*
>>>
>>> 12G per TM, 2 slots
>>> yarn.heartbeat.container-request-interval=500
>>>
>>> 685
>>>
>>> 12G per TM, 2 slots
>>> yarn.heartbeat.container-request-interval=5000
>>>
>>> 552
>>>
>>> 12G per TM, 2 slots
>>> yarn.heartbeat.container-request-interval=10000
>>>
>>> 331
>>>
>>> 10G per TM, 1 slots
>>> yarn.heartbeat.container-request-interval=60000
>>>
>>> 478
>>>
>>>
>>>
>>> 2019-10-25 09:55:51,452 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying
>>> CHAIN DataSource (synonym | Read Staging From File System | AVRO) -> Map
>>> (Map at readAvroFileWithFilter(FlinkReadUtils.java:78)) -> Map (Key
>>> Extractor) (14/90) (attempt #0) to
>>> container_e22_1571837093169_78279_01_000852 @ d50503-004-e22.dc.gs.com
>>> (dataPort=33579)
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000909 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000909.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000910 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000910.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000911 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000911.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000912 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000912.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000913 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000913.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000914 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000914.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000915 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000915.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000916 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000916.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000917 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000917.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000918 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000918.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000919 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000919.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000920 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000920.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000921 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000921.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000922 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000922.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000923 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000923.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000924 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000924.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000925 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000925.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000926 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000926.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000927 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000927.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000928 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000928.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000929 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000929.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000930 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000930.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000931 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000931.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000932 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000932.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000933 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000933.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000934 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000934.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000935 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000935.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000936 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000936.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000937 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000937.
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000939 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,513 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000939.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000940 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000940.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000941 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000941.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000942 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000942.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000943 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000943.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000944 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000944.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000945 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000945.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000946 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Returning
>>> excess container container_e22_1571837093169_78279_01_000946.
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     - Received
>>> new container: container_e22_1571837093169_78279_01_000947 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-25 09:55:51,514 INFO
>>> org.apache.flink.yarn.YarnResourceManager                     -
>>>
>>>
>>>
>>>
>>>
>>> *From:* Chan, Regina [Engineering]
>>> *Sent:* Wednesday, October 23, 2019 4:51 PM
>>> *To:* 'Till Rohrmann' <trohrm...@apache.org>; Yang Wang <
>>> danrtsey...@gmail.com>
>>> *Cc:* user@flink.apache.org
>>> *Subject:* RE: The RMClient's and YarnResourceManagers internal state
>>> about the number of pending container requests has diverged
>>>
>>>
>>>
>>> Yeah thanks for the responses. We’re in the process of testing 1.9.1
>>> after we found https://issues.apache.org/jira/browse/FLINK-12342
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12342&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=DJ6ltFlmNzuzhG0h4KzRpt9-12NF2eUOcHuOozTzpAk&s=mNfpHaW_AxiT2VmwVQ1kyTmfKQuSlF6yEvH6YQpDH-8&e=>
>>> as the cause of the original issue. FLINK-9455 makes sense as to why it
>>> didn’t work on legacy mode.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Till Rohrmann <trohrm...@apache.org>
>>> *Sent:* Wednesday, October 23, 2019 5:32 AM
>>> *To:* Yang Wang <danrtsey...@gmail.com>
>>> *Cc:* Chan, Regina [Engineering] <regina.c...@ny.email.gs.com>;
>>> user@flink.apache.org
>>> *Subject:* Re: The RMClient's and YarnResourceManagers internal state
>>> about the number of pending container requests has diverged
>>>
>>>
>>>
>>> Hi Regina,
>>>
>>>
>>>
>>> When using the FLIP-6 mode, you can control how long it takes for an
>>> idle TaskManager to be released via resourcemanager.taskmanager-timeout.
>>> Per default it is set to 30s.
>>>
>>>
>>>
>>> In the Flink version you are using, 1.6.4, we do not support
>>> TaskManagers with multiple slots properly [1]. The consequence is that
>>> Flink will request too many containers if you are using FLIP-6 and
>>> configured your TaskManagers to be started with more than a single slot.
>>> With Flink >= 1.7.0 this issue has been fixed.
>>>
>>>
>>>
>>> For the problem with the legacy mode it seems that there is a bug in the
>>> YarnFlinkResourceManager where we decrement the number of pending container
>>> requests by 2 instead of 1 every time a container is allocated [2]. This
>>> could explain the difference.
>>>
>>>
>>>
>>> Since the Flink community no longer actively maintains Flink 1.6, I was
>>> wondering whether it would be possible for you to upgrade to a later
>>> version of Flink? I believe that your observed problems are fixed in a more
>>> recent version (1.9.1).
>>>
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-9455
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D9455&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=ZRpDhwXKaaezw-9M5MISk2_jl7BSQMqQHNam2urC3wo&e=>
>>>
>>> [2]
>>> https://github.com/apache/flink/blob/release-1.6.4/flink-yarn/src/main/java/org/apache/flink/yarn/YarnFlinkResourceManager.java#L457
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_release-2D1.6.4_flink-2Dyarn_src_main_java_org_apache_flink_yarn_YarnFlinkResourceManager.java-23L457&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=N10tDU0UCmVy2WpUN3w2yNRTzvi8Yl79ryhV0icpILE&e=>
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Till
>>>
>>>
>>>
>>> On Wed, Oct 23, 2019 at 10:37 AM Yang Wang <danrtsey...@gmail.com>
>>> wrote:
>>>
>>> Hi Chan,
>>>
>>>
>>>
>>> After FLIP-6, the Flink ResourceManager dynamically allocate resource
>>> from Yarn on demand.
>>>
>>> What's your flink version? On the current code base, if the pending
>>> containers in resource manager
>>>
>>> is zero, then it will releaseall the excess containers. Could you please
>>> check the
>>>
>>> "Remaining pending container requests" in your jm logs?
>>>
>>>
>>>
>>> On the other hand, Flink should not allocate such many resources. Do you
>>> set the `taskmanager.numberOfTaskSlots`?
>>>
>>> The default value is 1 and will allocate containers based on your max
>>> parallelism.
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>> Yang
>>>
>>>
>>>
>>> Chan, Regina <regina.c...@gs.com> 于2019年10月23日周三 上午12:40写道:
>>>
>>> Hi,
>>>
>>>
>>>
>>> One of our Flink jobs has a lot of tiny Flink Jobs (and some larger
>>> jobs) associated with it that then request and release resources as need as
>>> per the FLIP-6 mode. Internally we track how much parallelism we’ve used
>>> before submitting the new job so that we’re bounded by the expected top
>>> cap. What we found was that the job intermittently holds onto 20-40x what
>>> is expected and thereby eating into our cluster’s overall resources. It
>>> seems as if Flink isn’t releasing the resources back to Yarn quickly enough
>>> for these.
>>>
>>>
>>>
>>> As an immediate stop gap, what I tried doing was just to revert to using
>>> legacy mode hoping that the resource utilization is then at least constant
>>> as per the number of task managers + slots + memory allocated. However, we
>>> then ran into this issue. Why would the client’s pending container requests
>>> still be 60 when Yarn shows it’s been allocated? What can we do here?
>>>
>>>
>>>
>>> org.apache.flink.runtime.akka.StoppingSupervisorWithoutLoggingActorKilledExceptionStrategy
>>> - Actor failed with exception. Stopping it now.
>>>
>>> java.lang.IllegalStateException: The RMClient's and YarnResourceManagers
>>> internal state about the number of pending container requests has diverged.
>>> Number client's pending container requests 60 != Number RM's pending
>>> container requests 0.
>>>
>>>             at
>>> org.apache.flink.util.Preconditions.checkState(Preconditions.java:217)
>>>
>>>             at
>>> org.apache.flink.yarn.YarnFlinkResourceManager.getPendingRequests(YarnFlinkResourceManager.java:520)
>>>
>>>             at
>>> org.apache.flink.yarn.YarnFlinkResourceManager.containersAllocated(YarnFlinkResourceManager.java:449)
>>>
>>>             at
>>> org.apache.flink.yarn.YarnFlinkResourceManager.handleMessage(YarnFlinkResourceManager.java:227)
>>>
>>>             at
>>> org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
>>>
>>>             at
>>> org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
>>>
>>>             at
>>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>>
>>>             at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>
>>>             at
>>> akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>>
>>>             at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>
>>>             at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>>
>>>             at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>
>>>             at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>
>>>             at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>
>>>             at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>
>>>             at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>
>>>             at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>
>>>             at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>>
>>>
>>> JobManager logs: (full logs also attached)
>>>
>>>
>>>
>>> 2019-10-22 11:36:52,733 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000002 - Remaining
>>> pending container requests: 118
>>>
>>> 2019-10-22 11:36:52,734 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758612734: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000002, NodeId:
>>> d49111-041.dc.gs.com:45454, NodeHttpAddress: d49111-041.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.59.83.235:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.59.83.235-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=2dLvERgldwstlXieBLCEFT5CaNQmaiQuZ_RyceDe52s&e=>
>>> }, ] on host d49111-041.dc.gs.com
>>>
>>> 2019-10-22 11:36:52,736 INFO
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>>> Opening proxy : d49111-041.dc.gs.com:45454
>>>
>>> 2019-10-22 11:36:52,784 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000003 - Remaining
>>> pending container requests: 116
>>>
>>> 2019-10-22 11:36:52,784 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758612784: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000003, NodeId:
>>> d49111-162.dc.gs.com:45454, NodeHttpAddress: d49111-162.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.59.72.254:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.59.72.254-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=Ol_2CEUzRioQFPGNmHf4fQbCwPwQ24HIRP3SejFYQnY&e=>
>>> }, ] on host d49111-162.dc.gs.com
>>>
>>> ….
>>>
>>> Received new container: container_e102_1569128826219_23941567_01_000066
>>> - Remaining pending container requests: 2
>>>
>>> 2019-10-22 11:36:53,409 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758613409: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000066, NodeId:
>>> d49111-275.dc.gs.com:45454, NodeHttpAddress: d49111-275.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.50.199.239:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.50.199.239-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=CrPBHrAgVH6EvaoXyJCdOsipDIFwk0zEouDSsgK_Ctg&e=>
>>> }, ] on host d49111-275.dc.gs.com
>>>
>>> 2019-10-22 11:36:53,411 INFO
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>>> Opening proxy : d49111-275.dc.gs.com:45454
>>>
>>> 2019-10-22 11:36:53,418 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000067 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-22 11:36:53,418 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758613418: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000067, NodeId:
>>> d49111-409.dc.gs.com:45454, NodeHttpAddress: d49111-409.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.59.40.203:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.59.40.203-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=ztlqNS0esLyb8yX2V9ZJ3Oi5KQftlm2GDop27L0HFmQ&e=>
>>> }, ] on host d49111-409.dc.gs.com
>>>
>>> 2019-10-22 11:36:53,420 INFO
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>>> Opening proxy : d49111-409.dc.gs.com:45454
>>>
>>> 2019-10-22 11:36:53,430 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000070 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-22 11:36:53,430 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758613430: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000070, NodeId:
>>> d49111-167.dc.gs.com:45454, NodeHttpAddress: d49111-167.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.51.138.251:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.51.138.251-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=nODw9DSkWxN9vGI51uhus2-Y4JgHzUdKDNtk9GYBBwo&e=>
>>> }, ] on host d49111-167.dc.gs.com
>>>
>>> 2019-10-22 11:36:53,432 INFO
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>>> Opening proxy : d49111-167.dc.gs.com:45454
>>>
>>> 2019-10-22 11:36:53,439 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000072 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-22 11:36:53,440 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758613439: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000072, NodeId:
>>> d49111-436.dc.gs.com:45454, NodeHttpAddress: d49111-436.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.59.235.176:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.59.235.176-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=_JSGjbn3TMY5B3hwbH0o6ybTfAMVzVwLZeH_dCfFYAo&e=>
>>> }, ] on host d49111-436.dc.gs.com
>>>
>>> 2019-10-22 11:36:53,441 INFO
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  -
>>> Opening proxy : d49111-436.dc.gs.com:45454
>>>
>>> 2019-10-22 11:36:53,449 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Received
>>> new container: container_e102_1569128826219_23941567_01_000073 - Remaining
>>> pending container requests: 0
>>>
>>> 2019-10-22 11:36:53,449 INFO
>>> org.apache.flink.yarn.YarnFlinkResourceManager                - Launching
>>> TaskManager in container ContainerInLaunch @ 1571758613449: Container:
>>> [ContainerId: container_e102_1569128826219_23941567_01_000073, NodeId:
>>> d49111-387.dc.gs.com:45454, NodeHttpAddress: d49111-387.dc.gs.com:8042,
>>> Resource: <memory:12288, vCores:2>, Priority: 0, Token: Token { kind:
>>> ContainerToken, service: 10.51.136.247:45454
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.51.136.247-3A45454&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=k2KD7mNjGSdTsB7265E5xpIqZXzOPAfcl6p_Fi7Do78&s=kcJvHJwB43UAGSBBCXT6i-9MOUPQt4_HpSt3EnZX7YE&e=>
>>> }, ] on host d49111-387.dc.gs.com
>>>
>>> …..
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Regina
>>>
>>>
>>> ------------------------------
>>>
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>>
>>> ------------------------------
>>>
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>>
>>> ------------------------------
>>>
>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices
>>>
>>

Reply via email to