Glad that helped. I'm also posting this conversation to the public mailing list, in case other people have similar questions.
And regarding the GC statement, I think the document is correct. - Flink Memory Manager guarantees that the amount of allocated managed memory never exceed the configured capacity, thus managed memory allocation should not trigger OOM. - When preallocation is enabled, managed memory segments are allocated and pooled by Flink Memory Manager, no matter there are tasks requesting them or not. The segments will not be deallocated until the cluster is shutdown. - When preallocation is disabled, managed memory segments are allocated only when tasks requesting them, and destroyed immediately when tasks return them to the Memory Manager. However, what this statement trying to say is that, the memory is not deallocated directly when the memory segment is destroyed, but will have to wait until the GC to be truly released. Thank you~ Xintong Song On Tue, Dec 17, 2019 at 12:30 PM Ethan Li <ethanopensou...@gmail.com> wrote: > Thank you very much Xintong! It’s much clear to me now. > > I am still on standalone cluster setup. Before I was using 350GB on-heap > memory on a 378GB box. I saw a lot of swap activities. Now I understand > that it’s because RocksDB didn’t have enough memory to use, so OS forces > JVM to swap. It can explain why the cluster was not stable and kept > crashing. > > Now that I put 150GB off-heap and 150GB on-heap, the cluster is more > stable than before. I thought it was because GC was reduced because now we > have less heap memory. Now I understand that it’s because I have 78GB > memory available for rocksDB to use, 50GB more than before. And it explains > why I don’t see swaps anymore. > > This makes sense to me now. I just have to set preallocation to false to > use the other 150 GB off-heap memory for rocksDB and do some tuning on > these memory configs. > > > One thing I noticed is that in > https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate > > If this configuration is set to false cleaning up of the allocated > off-heap memory happens only when the configured JVM parameter > MaxDirectMemorySize is reached by triggering a full GC > > I think this statement is not correct. GC is not trigged by reaching > MaxDirectMemorySize. It will throw "java.lang.OutOfMemoryError: Direct > buffer memory” if MaxDirectMemorySize is reached. > > Thank you again for your help! > > Best, > Ethan > > > On Dec 16, 2019, at 9:44 PM, Xintong Song <tonysong...@gmail.com> wrote: > > Hi Ethan, > > When you say "it's doing better than before", what is your setups before? > Is it on-heap managed memory? With preallocation enabled or disabled? Also, > what deployment (standalone, yarn, or local executor) do you run Flink on? > It's hard to tell why the performance becomes better without knowing the > information above. > > Since you are using RocksDB, and configure managed memory to off-heap, you > should set pre-allocation to false. Steaming job with RocksDB state backend > does not use managed memory at all. Setting managed memory to off-heap only > makes Flink to launch JVM with smaller heap space, leaving more space > outside JVM. Setting pre-allocation to false makes Flink allocate those > managed memory on-demand, and since there's no demand the managed memory > will not be allocated. Therefore, the memory space left outside JVM can be > fully leveraged by RocksDB. > > Regarding related source codes, I would recommend the following: > - MemoryManager - For how managed memory is allocated / used. Related to > pre-allocation. > - ContaineredTaskManagerParameters - For how the JVM memory parameters are > decided. Related to on-heap / off-heap managed memory. > - TaskManagerServices#fromConfiguration - For how different components are > created, as well as how their memory sizes are decided. Also related to > on-heap / off-heap managed memory. > > Thank you~ > Xintong Song > > > > On Tue, Dec 17, 2019 at 11:00 AM Ethan Li <ethanopensou...@gmail.com> > wrote: > >> Thank you Xintong, Vino for taking your time answering my question. I >> didn’t know managed memory is only for batch jobs. >> >> >> >> I tried to set to use off-heap Flink managed memory (with preallocation >> to true) and it’s doing better than before. It would not make sense if >> managed memory is not used. I was confused. Then I found this doc >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-49:+Unified+Memory+Configuration+for+TaskExecutors> >> >> Configuring an off-heap state backend like RocksDB means either also >> setting managed memory to off-heap or adjusting the cutoff ratio, to >> dedicate less memory to the JVM heap. >> >> >> We use RocksDB too so I guess I was doing that correctly by accident. So >> the question here is, in this case, should we set preallocate to true or >> false? >> >> If set to true, TM will allocate memory off-heap during start up. Will >> this part of memory being used by RocksDB? >> If set to false, how is this off-memory memory being managed? Will the >> allocated memory ever being cleaned up and reused? >> >> I’d really appreciate if you or anyone from the community can share some >> ideas or point me to the code. I am reading the source code but haven’t got >> there. >> >> Thank you very much! >> >> Best, >> Ethan >> >> >> >> On Dec 16, 2019, at 1:27 AM, Xintong Song <tonysong...@gmail.com> wrote: >> >> Hi Ethan, >> >> Currently, managed memory is only used for batch jobs (DataSet / Blink >> SQL). Setting it to off-heap and enable pre-allocation can improve the >> performance on using managed memory. However, since you are running >> streaming jobs which "currently do not use the managed memory", I would >> suggest you to set managed memory to on-heap and disable pre-allocation. In >> this way, Flink will not allocate any managed memory segments which are >> actually not used, and the corresponding memory can still be used for other >> JVM heap usages. >> >> The above is for Flink 1.9 and earlier. In the upcoming Flink 1.10, we >> are removing the pre-allocation of managed memory, making managed memory >> always off-heap, and making rocksdb state backend to use managed memory. >> Which means the two config options you mentioned will no longer exist in >> the future releases. In case you're planing to migrate to the upcoming >> Flink 1.10, if your streaming jobs are using rocksdb state backend, then >> hopefully it's not necessary for you to change any configuration, but if >> your jobs are using heap state backend, it would be better to config the >> managed memory size / fraction to 0 because otherwise the corresponding >> memory cannot be used by any component. >> >> Thank you~ >> Xintong Song >> >> >> >> On Sat, Dec 14, 2019 at 5:20 AM Ethan Li <ethanopensou...@gmail.com> >> wrote: >> >>> Hi Community, >>> >>> I have a question about the taskmanager.memory.preallocate config in the >>> doc >>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate >>> >>> We have large memory box so as it suggested we should use off heap >>> memory for flink managed memory. And the doc then suggests to >>> set taskmanager.memory.preallocate to true. However, >>> >>> "For streaming setups is is highly recommended to set this value to >>> false as the core state backends currently do not use the managed memory." >>> >>> >>> Our flink set up is mainly for streaming jobs so I think the above >>> applies to our case. So should I use off-heap with “preallocate" setting to >>> false? What would be the impact with these configs? >>> >>> >>> Thank you very much! >>> >>> >>> Best, >>> Ethan >>> >> >> >