i was reading a bit about RocksDb and it seems the Java version is somewhat particular about how it should be cleaned up to ensure all resources are cleaned up:
<https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management> ttps://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management <https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management> - "Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them." Column families also have a specific close procedure <https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#opening-a-database-with-column-families> https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#opening-a-database-with-column-families - "It is important to note that when working with Column Families in RocksJava, there is a very specific order of destruction that must be obeyed for the database to correctly free all resources and shutdown." When a running job fails and a running TaskManager restores from checkpoint, is the old Embedded RocksDb being cleaned up properly? I wasn't really sure where to look in the Flink source code to verify this. On Mon, Oct 4, 2021 at 4:56 PM Kevin Lam <kevin....@shopify.com> wrote: > We tried with 1.14.0, unfortunately we still run into the issue. Any > thoughts or suggestions? > > On Mon, Oct 4, 2021 at 9:09 AM Kevin Lam <kevin....@shopify.com> wrote: > >> Hi Fabian, >> >> We're using our own image built from the official Flink docker image, so >> we should have the code to use jemalloc in the docker entrypoint. >> >> I'm going to give 1.14 a try and will let you know how it goes. >> >> On Mon, Oct 4, 2021 at 8:29 AM Fabian Paul <fabianp...@ververica.com> >> wrote: >> >>> Hi Kevin, >>> >>> We bumped the RocksDb version with Flink 1.14 which we thought increases >>> the memory control [1]. In the past we also saw problems with the allocator >>> used of the OS. We switched to use jemalloc within our docker images which >>> has a better memory fragmentation [2]. Are you using the official Flink >>> docker image or did you build your own? >>> >>> I am also pulling in yun tang who is more familiar with Flinkās state >>> backend. Maybe he has an immediate idea about your problem. >>> >>> Best, >>> Fabian >>> >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-14482 >>> [2] >>> https://lists.apache.org/thread.html/r596a19f8cf7278bcf9e30c3060cf00562677d4be072050444a5caf99%40%3Cdev.flink.apache.org%3E >>> <https://lists.apache.org/thread.html/r596a19f8cf7278bcf9e30c3060cf00562677d4be072050444a5caf99@%3Cdev.flink.apache.org%3E> >>> >>> >>>