Re: Questions about taskmanager.memory.off-heap and taskmanager.memory.preallocate

Ethan Li Wed, 18 Dec 2019 06:53:54 -0800

Thank you Vino for the information.

Best,
Ethan


> On Dec 17, 2019, at 8:29 PM, vino yang <yanghua1...@gmail.com> wrote:
> 
> Hi Ethan,
> 
> Share two things:
> 
> I have found "taskmanager.memory.preallocate" config option has been removed 
> in the master codebase.
> After researching git history, I found the description of 
> "taskmanager.memory.preallocate" was written by @Chesnay Schepler 
> <mailto:ches...@apache.org>  (from 1.8 branch). So maybe he can give more 
> context or information. Correct me, if I am wrong.
> Best,
> Vino.
> 
> Ethan Li <ethanopensou...@gmail.com <mailto:ethanopensou...@gmail.com>> 
> 于2019年12月18日周三 上午10:07写道：
> I didn’t realize we was not chatting in the mailing list :)
> 
> I think it’s wrong because it kind of says full GC is triggered by reaching 
> MaxDirecMemorySize. 
> 
> 
>> On Dec 16, 2019, at 11:03 PM, Xintong Song <tonysong...@gmail.com 
>> <mailto:tonysong...@gmail.com>> wrote:
>> 
>> Glad that helped. I'm also posting this conversation to the public mailing 
>> list, in case other people have similar questions.
>> 
>> And regarding the GC statement, I think the document is correct.
>> - Flink Memory Manager guarantees that the amount of allocated managed 
>> memory never exceed the configured capacity, thus managed memory allocation 
>> should not trigger OOM.
>> - When preallocation is enabled, managed memory segments are allocated and 
>> pooled by Flink Memory Manager, no matter there are tasks requesting them or 
>> not. The segments will not be deallocated until the cluster is shutdown.
>> - When preallocation is disabled, managed memory segments are allocated only 
>> when tasks requesting them, and destroyed immediately when tasks return them 
>> to the Memory Manager. However, what this statement trying to say is that, 
>> the memory is not deallocated directly when the memory segment is destroyed, 
>> but will have to wait until the GC to be truly released.
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Tue, Dec 17, 2019 at 12:30 PM Ethan Li <ethanopensou...@gmail.com 
>> <mailto:ethanopensou...@gmail.com>> wrote:
>> Thank you very much Xintong! It’s much clear to me now. 
>> 
>> I am still on standalone cluster setup.  Before I was using 350GB on-heap 
>> memory on a 378GB box. I saw a lot of swap activities. Now I understand that 
>> it’s because RocksDB didn’t have enough memory to use, so OS forces JVM to 
>> swap. It can explain why the cluster was not stable and kept crashing.
>> 
>> Now that I put 150GB off-heap and 150GB on-heap, the cluster is more stable 
>> than before. I thought it was because GC was reduced because now we have 
>> less heap memory. Now I understand that it’s because I have 78GB memory 
>> available for rocksDB to use, 50GB more than before. And it explains why I 
>> don’t see swaps anymore. 
>> 
>> This makes sense to me now. I just have to set preallocation to false to use 
>> the other 150 GB off-heap memory for rocksDB and do some tuning on these 
>> memory configs. 
>> 
>> 
>> One thing I noticed is that in 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate>
>> 
>>  If this configuration is set to false cleaning up of the allocated off-heap 
>> memory happens only when the configured JVM parameter MaxDirectMemorySize is 
>> reached by triggering a full GC
>> 
>> I think this statement is not correct. GC is not trigged by reaching 
>> MaxDirectMemorySize. It will throw "java.lang.OutOfMemoryError: Direct 
>> buffer memory” if MaxDirectMemorySize is reached. 
>> 
>> Thank you again for your help!
>> 
>> Best,
>> Ethan
>> 
>> 
>>> On Dec 16, 2019, at 9:44 PM, Xintong Song <tonysong...@gmail.com 
>>> <mailto:tonysong...@gmail.com>> wrote:
>>> 
>>> Hi Ethan,
>>> 
>>> When you say "it's doing better than before", what is your setups before? 
>>> Is it on-heap managed memory? With preallocation enabled or disabled? Also, 
>>> what deployment (standalone, yarn, or local executor) do you run Flink on? 
>>> It's hard to tell why the performance becomes better without knowing the 
>>> information above.
>>> 
>>> Since you are using RocksDB, and configure managed memory to off-heap, you 
>>> should set pre-allocation to false. Steaming job with RocksDB state backend 
>>> does not use managed memory at all. Setting managed memory to off-heap only 
>>> makes Flink to launch JVM with smaller heap space, leaving more space 
>>> outside JVM. Setting pre-allocation to false makes Flink allocate those 
>>> managed memory on-demand, and since there's no demand the managed memory 
>>> will not be allocated. Therefore, the memory space left outside JVM can be 
>>> fully leveraged by RocksDB.
>>> 
>>> Regarding related source codes, I would recommend the following:
>>> - MemoryManager - For how managed memory is allocated / used. Related to 
>>> pre-allocation.
>>> - ContaineredTaskManagerParameters - For how the JVM memory parameters are 
>>> decided. Related to on-heap / off-heap managed memory.
>>> - TaskManagerServices#fromConfiguration - For how different components are 
>>> created, as well as how their memory sizes are decided. Also related to 
>>> on-heap / off-heap managed memory.
>>> 
>>> Thank you~
>>> Xintong Song
>>> 
>>> 
>>> On Tue, Dec 17, 2019 at 11:00 AM Ethan Li <ethanopensou...@gmail.com 
>>> <mailto:ethanopensou...@gmail.com>> wrote:
>>> Thank you Xintong, Vino for taking your time answering my question. I 
>>> didn’t know managed memory is only for batch jobs.
>>> 
>>> 
>>> 
>>> I tried to set to use off-heap Flink managed memory (with preallocation to 
>>> true) and it’s doing better than before. It would not make sense if managed 
>>> memory is not used. I was confused. Then I found this doc 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>>  
>>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-49:+Unified+Memory+Configuration+for+TaskExecutors>
>>> 
>>> Configuring an off-heap state backend like RocksDB means either also 
>>> setting managed memory to off-heap or adjusting the cutoff ratio, to 
>>> dedicate less memory to the JVM heap.
>>> 
>>> 
>>> We use RocksDB too so I guess I was doing that correctly by accident. So 
>>> the question here is, in this case, should we set preallocate to true or 
>>> false?
>>> 
>>> If set to true, TM will allocate memory off-heap during start up. Will this 
>>> part of memory being used by RocksDB?
>>> If set to false, how is this off-memory memory being managed? Will the 
>>> allocated memory ever being cleaned up and reused? 
>>> 
>>> I’d really appreciate if you or anyone from the community can share some 
>>> ideas or point me to the code. I am reading the source code but haven’t got 
>>> there. 
>>> 
>>> Thank you very much!
>>> 
>>> Best,
>>> Ethan
>>> 
>>> 
>>> 
>>>> On Dec 16, 2019, at 1:27 AM, Xintong Song <tonysong...@gmail.com 
>>>> <mailto:tonysong...@gmail.com>> wrote:
>>>> 
>>>> Hi Ethan,
>>>> 
>>>> Currently, managed memory is only used for batch jobs (DataSet / Blink 
>>>> SQL). Setting it to off-heap and enable pre-allocation can improve the 
>>>> performance on using managed memory. However, since you are running 
>>>> streaming jobs which "currently do not use the managed memory", I would 
>>>> suggest you to set managed memory to on-heap and disable pre-allocation. 
>>>> In this way, Flink will not allocate any managed memory segments which are 
>>>> actually not used, and the corresponding memory can still be used for 
>>>> other JVM heap usages.
>>>> 
>>>> The above is for Flink 1.9 and earlier. In the upcoming Flink 1.10, we are 
>>>> removing the pre-allocation of managed memory, making managed memory 
>>>> always off-heap, and making rocksdb state backend to use managed memory. 
>>>> Which means the two config options you mentioned will no longer exist in 
>>>> the future releases. In case you're planing to migrate to the upcoming 
>>>> Flink 1.10, if your streaming jobs are using rocksdb state backend, then 
>>>> hopefully it's not necessary for you to change any configuration, but if 
>>>> your jobs are using heap state backend, it would be better to config the 
>>>> managed memory size / fraction to 0 because otherwise the corresponding 
>>>> memory cannot be used by any component.
>>>> 
>>>> Thank you~
>>>> Xintong Song
>>>> 
>>>> 
>>>> On Sat, Dec 14, 2019 at 5:20 AM Ethan Li <ethanopensou...@gmail.com 
>>>> <mailto:ethanopensou...@gmail.com>> wrote:
>>>> Hi Community,
>>>> 
>>>> I have a question about the taskmanager.memory.preallocate config in the 
>>>> doc 
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate
>>>>  
>>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-memory-preallocate>
>>>> 
>>>> We have large memory box so as it suggested we should use off heap memory 
>>>> for flink managed memory. And the doc then suggests to set 
>>>> taskmanager.memory.preallocate to true. However,
>>>> 
>>>>  "For streaming setups is is highly recommended to set this value to false 
>>>> as the core state backends currently do not use the managed memory."
>>>> 
>>>> 
>>>> Our flink set up is mainly for streaming jobs so I think the above applies 
>>>> to our case. So should I use off-heap with “preallocate" setting to false? 
>>>> What would be the impact with these configs?
>>>> 
>>>> 
>>>> Thank you very much!
>>>> 
>>>> 
>>>> Best,
>>>> Ethan
>>> 
>> 
>

Re: Questions about taskmanager.memory.off-heap and taskmanager.memory.preallocate

Reply via email to