[ https://issues.apache.org/jira/browse/FLINK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346558#comment-17346558 ]
Yun Tang commented on FLINK-21986: ---------------------------------- [~Feifan Wang], could you tell me what the docker image, the parallelism, the related memory configuration and what operations you would take when you run flink-21986-regular-join-test-case? It seems I did not reproduce the problem of memory continue growing up after failover restore when running flink-21986-regular-join-test-case. > taskmanager native memory not release timely after restart > ---------------------------------------------------------- > > Key: FLINK-21986 > URL: https://issues.apache.org/jira/browse/FLINK-21986 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 1.11.3, 1.12.1, 1.13.0 > Environment: flink version:1.12.1 > run :yarn session > job type:mock source -> regular join > > checkpoint interval: 3m > Taskmanager memory : 16G > > Reporter: Feifan Wang > Assignee: Feifan Wang > Priority: Critical > Labels: pull-request-available > Fix For: 1.11.4, 1.13.0, 1.12.3 > > Attachments: 82544.svg, image-2021-03-25-15-53-44-214.png, > image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, > image-2021-03-26-11-47-21-388.png > > > I run a regular join job with flink_1.12.1 , and find taskmanager native > memory not release timely after restart cause by exceeded checkpoint > tolerable failure threshold. > *problem job information:* > # job first restart cause by exceeded checkpoint tolerable failure threshold. > # then taskmanager be killed by yarn many times > # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G > !image-2021-03-25-15-53-44-214.png|width=496,height=103! > # nonheap size increase after restart,but still under 160M. > > !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102! > # taskmanager process memory increase 3-4G after restart(this figure show > one of taskmanager) > !image-2021-03-25-16-07-29-083.png|width=493,height=107! > > *my guess:* > [RocksDB > wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management] > mentioned :Many of the Java Objects used in the RocksJava API will be backed > by C++ objects for which the Java Objects have ownership. As C++ has no > notion of automatic garbage collection for its heap in the way that Java > does, we must explicitly free the memory used by the C++ objects when we are > finished with them. > So, is it possible that RocksDBStateBackend not call > AbstractNativeReference#close() to release memory use by RocksDB C++ Object ? > *I make a change:* > Actively call System.gc() and System.runFinalization() every minute. > *And run this test again:* > # taskmanager process memory no obvious increase > !image-2021-03-26-11-46-06-828.png|width=495,height=93! > # job run for several days,and restart many times,but no taskmanager killed > by yarn like before > > *Summary:* > # first,there is some native memory can not release timely after restart in > this situation > # I guess it maybe RocksDB C++ object,but I hive not check it from source > code of RocksDBStateBackend > -- This message was sent by Atlassian Jira (v8.3.4#803005)