Hi community, 
        I raised this issue about three weeks ago. After several weeks of 
investigation, I found the root cause of this issue and explained it in the 
issue comments. 
        And I raised a PR to fix this problem ( I'm sorry that I didn't know 
before that I should raise the PR after the issue was assigned to me. I will 
pay attention next time. ). 
        Now I request the committer of the relevant module to check this issue, 
assign this issue to me, and review this PR.


        Issue URL: https://issues.apache.org/jira/browse/FLINK-21986
        PR URL: https://github.com/apache/flink/pull/15619


Best wishes,
Feifan Wang


——————————————
Name: Feifan Wang
Email: zoltar9...@163.com


On 03/26/2021 12:00,Feifan Wang (Jira)<j...@apache.org> wrote:
Feifan Wang created FLINK-21986:
-----------------------------------

Summary: taskmanager native memory not release timely after restart
Key: FLINK-21986
URL: https://issues.apache.org/jira/browse/FLINK-21986
Project: Flink
Issue Type: Bug
Components: Runtime / State Backends
Affects Versions: 1.12.1
Environment: flink version:1.12.1
run :yarn session
job type:mock source -> regular join
 
checkpoint interval: 3m
Taskmanager memory : 16G
 
Reporter: Feifan Wang
Attachments: image-2021-03-25-15-53-44-214.png, 
image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, 
image-2021-03-26-11-47-21-388.png

I run a regular join job with flink_1.12.1 , and find taskmanager native memory 
not release timely after restart cause by exceeded checkpoint tolerable failure 
threshold.

*problem job information:*
# job first restart cause by exceeded checkpoint tolerable failure threshold.
# then taskmanager be killed by yarn many times
# in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
!image-2021-03-25-15-53-44-214.png|width=496,height=103!
# nonheap size increase after restart,but still under 160M.
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
# taskmanager process memory increase 3-4G after restart(this figure show one 
of taskmanager)
!image-2021-03-25-16-07-29-083.png|width=493,height=107!

*my guess:*

 

[RocksDB 
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
 mentioned :Many of the Java Objects used in the RocksJava API will be backed 
by C++ objects for which the Java Objects have ownership. As C++ has no notion 
of automatic garbage collection for its heap in the way that Java does, we must 
explicitly free the memory used by the C++ objects when we are finished with 
them.

 

So, is it possible that RocksDBStateBackend not call 
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

*I make a change:*

        Actively call System.gc() and System.runFinalization() every minute.

 *And run this test again:*
# taskmanager process memory no obvious increase
!image-2021-03-26-11-46-06-828.png|width=495,height=93!
# job run for several days,and restart many times,but no taskmanager killed by 
yarn like before



*Summary:*
# first,there is some native memory can not release timely after restart in 
this situation
# I guess it maybe RocksDB C++ object,but I hive not check it from source code 
of RocksDBStateBackend

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to