[
https://issues.apache.org/jira/browse/FLINK-39924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Khachatryan reassigned FLINK-39924:
-----------------------------------------
Assignee: Keith Lee
> Memory fragmentation from jemalloc misconfiguration
> ---------------------------------------------------
>
> Key: FLINK-39924
> URL: https://issues.apache.org/jira/browse/FLINK-39924
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Configuration
> Affects Versions: 2.0.2, 2.2.1, 1.20.5, 2.1.3
> Reporter: Keith Lee
> Assignee: Keith Lee
> Priority: Critical
> Labels: pull-request-available
>
> We observed excessive memory fragmentation in production, using malloc_stats
> we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB
> Resident - 6.1 GB Active) which was significant as the pod has a limit of 16
> GB.
> We also observed that jemalloc arena count was higher than expected default
> of 4 x number_of_cpu_cores.
> h2. Why is high jemalloc arena count bad?
> Large number of arenas leads to infrequently used arenas, infrequently used
> arenas hold dirty pages for dirty_decay_ms before releasing memory to OS.
> This leaves less memory for Flink process and OS page cache, impacting
> performance and cause higher likelihood to OOMKill.
> h2. Root cause
> Jemalloc by default configures narena using the 4 * number_of_cpu_core,
> however *jemalloc is not container aware and the value for number_of_cpu_core
> is obtained from the host machine* instead of pod CPU resource configuration.
> See jemalloc default:
> [https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
> h2. Reproduction and confirmation
> Steps to reproduce can be found here:
> [https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
> The reproduction was ran on a 14 core Mac book pro. We find on a reduction of
> 10.7 % in resident set size and a slight performance improvement when narena
> is configured to 4 * pod CPU count.
> {{============================================================}}
> {{[+] Per-image summary:}}
> {{============================================================}}
> {{ image highest anon avg anon
> lowest write-recs avg write-recs}}
> {{ flink:2.2.1-scala_2.12-java17 1679.3 MiB 1522.6 MiB
> 186901 207614}}
> {{ flink-2.2.1-narenas4 1499.7 MiB 1301.9 MiB
> 200945 213198}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)