Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-06 Thread Kevin Lam
Hi Fabian, Yes I can tell you a bit more about the job we are seeing the problem with. I'll simplify things a bit but this captures the essence: 1. Input datastreams are from a few kafka sources that we intend to join. 2. We wrap the datastreams we want to join into a common container class and k

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-06 Thread Fabian Paul
Hi Kevin, Since you are seeing the problem across multiple Flink versions and with the default RocksDb and custom configuration it might be related to something else. A lot of different components can allocate direct memory i.e. some filesystem implementations, the connectors or some user grpc

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-06 Thread Kevin Lam
Hi Fabian, Thanks for collecting feedback. Here's the answers to your questions: 1. Yes, we enabled incremental checkpoints for our job by setting `state.backend.incremental` to true. As for whether the checkpoint we recover from is incremental or not, I'm not sure how to determine that. It's wha

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-06 Thread Fabian Paul
Hi Kevin, Sorry for the late reply. I collected some feedback from other folks and have two more questions. 1. Did you enable incremental checkpoints for your job and is the checkpoint you recover from incremental? 2. I saw in your configuration that you set `state.backend.rocksdb.block.cach

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-05 Thread Kevin Lam
i was reading a bit about RocksDb and it seems the Java version is somewhat particular about how it should be cleaned up to ensure all resources are cleaned up: ttps://github.com/facebook/rocksdb/wiki/RocksJava-Basics#me

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-04 Thread Kevin Lam
We tried with 1.14.0, unfortunately we still run into the issue. Any thoughts or suggestions? On Mon, Oct 4, 2021 at 9:09 AM Kevin Lam wrote: > Hi Fabian, > > We're using our own image built from the official Flink docker image, so > we should have the code to use jemalloc in the docker entrypoi

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-04 Thread Kevin Lam
Hi Fabian, We're using our own image built from the official Flink docker image, so we should have the code to use jemalloc in the docker entrypoint. I'm going to give 1.14 a try and will let you know how it goes. On Mon, Oct 4, 2021 at 8:29 AM Fabian Paul wrote: > Hi Kevin, > > We bumped the

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-04 Thread Fabian Paul
Hi Kevin, We bumped the RocksDb version with Flink 1.14 which we thought increases the memory control [1]. In the past we also saw problems with the allocator used of the OS. We switched to use jemalloc within our docker images which has a better memory fragmentation [2]. Are you using the offi

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-01 Thread Kevin Lam
Hi Fabian, Thanks for your response. Sure, let me tell you a bit more about the job. - Flink version 1.13.1 (I also tried 1.13.2 because I saw FLINK-22886 , but this didn't help) - We're running on kubernetes in an application cluste

Re: RocksDB: Spike in Memory Usage Post Restart

2021-10-01 Thread Fabian Paul
Hi Kevin, You are right RocksDB is probably responsible for the memory consumption you are noticing. We have definitely seen similar issues in the past and with the latest Flink version 1.14 we tried to restrict the RocksDB memory consumption even more to make it better controllable. Can you t

RocksDB: Spike in Memory Usage Post Restart

2021-09-30 Thread Kevin Lam
Hi all, We're debugging an issue with OOMs that occurs on our jobs shortly after a restore from checkpoint. Our application is running on kubernetes and uses RocksDB as it's state backend. We reproduced the issue on a small cluster of 2 task managers. If we killed a single task manager, we notice