Re: Questions about taskmanager.memory.off-heap and taskmanager.memory.preallocate

Ethan Li Tue, 17 Dec 2019 18:07:42 -0800

I didn’t realize we was not chatting in the mailing list :)

I think it’s wrong because it kind of says full GC is triggered by reaching 
MaxDirecMemorySize.



> On Dec 16, 2019, at 11:03 PM, Xintong Song <tonysong...@gmail.com> wrote:
> 
> Glad that helped. I'm also posting this conversation to the public mailing 
> list, in case other people have similar questions.
> 
> And regarding the GC statement, I think the document is correct.
> - Flink Memory Manager guarantees that the amount of allocated managed memory 
> never exceed the configured capacity, thus managed memory allocation should 
> not trigger OOM.
> - When preallocation is enabled, managed memory segments are allocated and 
> pooled by Flink Memory Manager, no matter there are tasks requesting them or 
> not. The segments will not be deallocated until the cluster is shutdown.
> - When preallocation is disabled, managed memory segments are allocated only 
> when tasks requesting them, and destroyed immediately when tasks return them 
> to the Memory Manager. However, what this statement trying to say is that, 
> the memory is not deallocated directly when the memory segment is destroyed, 
> but will have to wait until the GC to be truly released.
> 
> Thank you~
> Xintong Song
> 
> 
> On Tue, Dec 17, 2019 at 12:30 PM Ethan Li <ethanopensou...@gmail.com 
> <mailto:ethanopensou...@gmail.com>> wrote:
> Thank you very much Xintong! It’s much clear to me now. 
> 
> I am still on standalone cluster setup.  Before I was using 350GB on-heap 
> memory on a 378GB box. I saw a lot of swap activities. Now I understand that 
> it’s because RocksDB didn’t have enough memory to use, so OS forces JVM to 
> swap. It can explain why the cluster was not stable and kept crashing.
> 
> Now that I put 150GB off-heap and 150GB on-heap, the cluster is more stable 
> than before. I thought it was because GC was reduced because now we have less 
> heap memory. Now I understand that it’s because I have 78GB memory available 
> for rocksDB to use, 50GB more than before. And it explains why I don’t see 
> swaps anymore. 
> 
> This makes sense to me now. I just have to set preallocation to false to use 
> the other 150 GB off-heap memory for rocksDB and do some tuning on these 
> memory configs. 
> 
> 
> One thing I noticed is that in 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate>
> 
>  If this configuration is set to false cleaning up of the allocated off-heap 
> memory happens only when the configured JVM parameter MaxDirectMemorySize is 
> reached by triggering a full GC
> 
> I think this statement is not correct. GC is not trigged by reaching 
> MaxDirectMemorySize. It will throw "java.lang.OutOfMemoryError: Direct buffer 
> memory” if MaxDirectMemorySize is reached. 
> 
> Thank you again for your help!
> 
> Best,
> Ethan
> 
> 
>> On Dec 16, 2019, at 9:44 PM, Xintong Song <tonysong...@gmail.com 
>> <mailto:tonysong...@gmail.com>> wrote:
>> 
>> Hi Ethan,
>> 
>> When you say "it's doing better than before", what is your setups before? Is 
>> it on-heap managed memory? With preallocation enabled or disabled? Also, 
>> what deployment (standalone, yarn, or local executor) do you run Flink on? 
>> It's hard to tell why the performance becomes better without knowing the 
>> information above.
>> 
>> Since you are using RocksDB, and configure managed memory to off-heap, you 
>> should set pre-allocation to false. Steaming job with RocksDB state backend 
>> does not use managed memory at all. Setting managed memory to off-heap only 
>> makes Flink to launch JVM with smaller heap space, leaving more space 
>> outside JVM. Setting pre-allocation to false makes Flink allocate those 
>> managed memory on-demand, and since there's no demand the managed memory 
>> will not be allocated. Therefore, the memory space left outside JVM can be 
>> fully leveraged by RocksDB.
>> 
>> Regarding related source codes, I would recommend the following:
>> - MemoryManager - For how managed memory is allocated / used. Related to 
>> pre-allocation.
>> - ContaineredTaskManagerParameters - For how the JVM memory parameters are 
>> decided. Related to on-heap / off-heap managed memory.
>> - TaskManagerServices#fromConfiguration - For how different components are 
>> created, as well as how their memory sizes are decided. Also related to 
>> on-heap / off-heap managed memory.
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Tue, Dec 17, 2019 at 11:00 AM Ethan Li <ethanopensou...@gmail.com 
>> <mailto:ethanopensou...@gmail.com>> wrote:
>> Thank you Xintong, Vino for taking your time answering my question. I didn’t 
>> know managed memory is only for batch jobs.
>> 
>> 
>> 
>> I tried to set to use off-heap Flink managed memory (with preallocation to 
>> true) and it’s doing better than before. It would not make sense if managed 
>> memory is not used. I was confused. Then I found this doc 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>  
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-49:+Unified+Memory+Configuration+for+TaskExecutors>
>> 
>> Configuring an off-heap state backend like RocksDB means either also setting 
>> managed memory to off-heap or adjusting the cutoff ratio, to dedicate less 
>> memory to the JVM heap.
>> 
>> 
>> We use RocksDB too so I guess I was doing that correctly by accident. So the 
>> question here is, in this case, should we set preallocate to true or false?
>> 
>> If set to true, TM will allocate memory off-heap during start up. Will this 
>> part of memory being used by RocksDB?
>> If set to false, how is this off-memory memory being managed? Will the 
>> allocated memory ever being cleaned up and reused? 
>> 
>> I’d really appreciate if you or anyone from the community can share some 
>> ideas or point me to the code. I am reading the source code but haven’t got 
>> there. 
>> 
>> Thank you very much!
>> 
>> Best,
>> Ethan
>> 
>> 
>> 
>>> On Dec 16, 2019, at 1:27 AM, Xintong Song <tonysong...@gmail.com 
>>> <mailto:tonysong...@gmail.com>> wrote:
>>> 
>>> Hi Ethan,
>>> 
>>> Currently, managed memory is only used for batch jobs (DataSet / Blink 
>>> SQL). Setting it to off-heap and enable pre-allocation can improve the 
>>> performance on using managed memory. However, since you are running 
>>> streaming jobs which "currently do not use the managed memory", I would 
>>> suggest you to set managed memory to on-heap and disable pre-allocation. In 
>>> this way, Flink will not allocate any managed memory segments which are 
>>> actually not used, and the corresponding memory can still be used for other 
>>> JVM heap usages.
>>> 
>>> The above is for Flink 1.9 and earlier. In the upcoming Flink 1.10, we are 
>>> removing the pre-allocation of managed memory, making managed memory always 
>>> off-heap, and making rocksdb state backend to use managed memory. Which 
>>> means the two config options you mentioned will no longer exist in the 
>>> future releases. In case you're planing to migrate to the upcoming Flink 
>>> 1.10, if your streaming jobs are using rocksdb state backend, then 
>>> hopefully it's not necessary for you to change any configuration, but if 
>>> your jobs are using heap state backend, it would be better to config the 
>>> managed memory size / fraction to 0 because otherwise the corresponding 
>>> memory cannot be used by any component.
>>> 
>>> Thank you~
>>> Xintong Song
>>> 
>>> 
>>> On Sat, Dec 14, 2019 at 5:20 AM Ethan Li <ethanopensou...@gmail.com 
>>> <mailto:ethanopensou...@gmail.com>> wrote:
>>> Hi Community,
>>> 
>>> I have a question about the taskmanager.memory.preallocate config in the 
>>> doc 
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate
>>>  
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate>
>>> 
>>> We have large memory box so as it suggested we should use off heap memory 
>>> for flink managed memory. And the doc then suggests to set 
>>> taskmanager.memory.preallocate to true. However,
>>> 
>>>  "For streaming setups is is highly recommended to set this value to false 
>>> as the core state backends currently do not use the managed memory."
>>> 
>>> 
>>> Our flink set up is mainly for streaming jobs so I think the above applies 
>>> to our case. So should I use off-heap with “preallocate" setting to false? 
>>> What would be the impact with these configs?
>>> 
>>> 
>>> Thank you very much!
>>> 
>>> 
>>> Best,
>>> Ethan
>> 
>

Re: Questions about taskmanager.memory.off-heap and taskmanager.memory.preallocate

Reply via email to