Hi Ori,

I'm not sure about where the problem comes from. There are several things
that might worse a try.
- Further increasing the `jvm-overhead`. Your `ps` result suggests that
the Flink process uses 120+GB, while `process.size` is configured 112GB. So
I think 2GB `jvm-overhead` might not be enough. I would suggest to tune
`managed.fraction` back to 0.4 and increase `jvm-overhead` to around 12GB.
This should give you roughly the same `process.size` as before, while
leaving more unmanaged native memory space.
- During the 7-10 job running days, are there any failovers/restarts? If
yes, you might want to look into this comment [1] in FLINK-18712.
- If neither of the above actions helps, we might need to leverage tools
(e.g., JVM NMT [2]) to track the native memory usages and see where exactly
the leak comes from.

Thank you~

Xintong Song


[1]
https://issues.apache.org/jira/browse/FLINK-18712?focusedCommentId=17189138&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17189138

[2]
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

On Thu, Oct 29, 2020 at 7:51 PM Ori Popowski <ori....@gmail.com> wrote:

>
> Hi Xintong,
>
> Unfortunately I cannot upgrade to 1.10.2, because EMR has either 1.10.0 or
> 1.11.0.
>
> About the overhead - turns out I already configured
> taskmanager.memory.jvm-overhead.max to 2 gb instead of the default 1 gb.
> Should I increase it further?
>
> state.backend.rocksdb.memory.managed is already not explicitly configured.
>
> Is there anything else I can do?
>
>
>
> On Thu, Oct 29, 2020 at 1:24 PM Xintong Song <tonysong...@gmail.com>
> wrote:
>
>> Hi Ori,
>>
>> RocksDB also uses managed memory. If the memory overuse indeed comes from
>> RocksDB, then increasing managed memory fraction will not help. RocksDB
>> will try to use as many memory as the configured managed memory size.
>> Therefore increasing managed memory fraction also makes RocksDB try to use
>> more memory. That is why I suggested increasing `jvm-overhead` instead.
>>
>> Please also make sure the configuration option
>> `state.backend.rocksdb.memory.managed` is either not explicitly configured,
>> or configured to `true`.
>>
>> In addition, I noticed that you are using Flink 1.10.0. You might want to
>> upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <ori....@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> PID 20331 is indeed the Flink process, specifically the TaskManager
>>> process.
>>>
>>> - Workload is a streaming workload reading from Kafka and writing to S3
>>> using a custom Sink
>>> - RockDB state backend is used with default settings
>>> - My external dependencies are:
>>> -- logback
>>> -- jackson
>>> -- flatbuffers
>>> -- jaxb-api
>>> -- scala-java8-compat
>>> -- apache commons-io
>>> -- apache commons-compress
>>> -- software.amazon.awssdk s3
>>> - What do you mean by UDFs? I've implemented several operators like
>>> KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.
>>>
>>> We use a SessionWindow with 30 minutes of gap, and a watermark with 10
>>> minutes delay.
>>>
>>> We did confirm we have some keys in our job which keep receiving records
>>> indefinitely, but I'm not sure why it would cause a managed memory leak,
>>> since this should be flushed to RocksDB and free the memory used. We have a
>>> guard against this, where we keep the overall size of all the records for
>>> each key, and when it reaches 300mb, we don't move the records downstream,
>>> which causes them to create a session and go through the sink.
>>>
>>> About what you suggested - I kind of did this by increasing the managed
>>> memory fraction to 0.5. And it did postpone the occurrence of the problem
>>> (meaning, the TMs started crashing after 10 days instead of 7 days). It
>>> looks like anything I'll do on that front will only postpone the problem
>>> but not solve it.
>>>
>>> I am attaching the full job configuration.
>>>
>>>
>>>
>>> On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <tonysong...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ori,
>>>>
>>>> It looks like Flink indeed uses more memory than expected. I assume the
>>>> first item with PID 20331 is the flink process, right?
>>>>
>>>> It would be helpful if you can briefly introduce your workload.
>>>> - What kind of workload are you running? Streaming or batch?
>>>> - Do you use RocksDB state backend?
>>>> - Any UDFs or 3rd party dependencies that might allocate significant
>>>> native memory?
>>>>
>>>> Moreover, if the metrics shows only 20% heap usages, I would suggest
>>>> configuring less `task.heap.size`, leaving more memory to off-heap. The
>>>> reduced heap size does not necessarily all go to the managed memory. You
>>>> can also try increasing the `jvm-overhead`, simply to leave more native
>>>> memory in the container in case there are other other significant native
>>>> memory usages.
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <ori....@gmail.com> wrote:
>>>>
>>>>> Hi Xintong,
>>>>>
>>>>> See here:
>>>>>
>>>>> # Top memory users
>>>>> ps auxwww --sort -rss | head -10
>>>>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME
>>>>> COMMAND
>>>>> yarn     20339 35.8 97.0 128600192 126672256 ? Sl   Oct15 5975:47
>>>>> /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
>>>>> root      5245  0.1  0.4 5580484 627436 ?      Sl   Jul30 144:39
>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
>>>>> hadoop    5252  0.1  0.4 7376768 604772 ?      Sl   Jul30 153:22
>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
>>>>> yarn     26857  0.3  0.2 4214784 341464 ?      Sl   Sep17 198:43
>>>>> /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
>>>>> root      5519  0.0  0.2 5658624 269344 ?      Sl   Jul30  45:21
>>>>> /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
>>>>> root      1781  0.0  0.0 172644  8096 ?        Ss   Jul30   2:06
>>>>> /usr/lib/systemd/systemd-journald
>>>>> root      4801  0.0  0.0 2690260 4776 ?        Ssl  Jul30   4:42
>>>>> /usr/bin/amazon-ssm-agent
>>>>> root      6566  0.0  0.0 164672  4116 ?        R    00:30   0:00 ps
>>>>> auxwww --sort -rss
>>>>> root      6532  0.0  0.0 183124  3592 ?        S    00:30   0:00
>>>>> /usr/sbin/CROND -n
>>>>>
>>>>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <tonysong...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ori,
>>>>>>
>>>>>> The error message suggests that there's not enough physical memory on
>>>>>> the machine to satisfy the allocation. This does not necessarily mean a
>>>>>> managed memory leak. Managed memory leak is only one of the 
>>>>>> possibilities.
>>>>>> There are other potential reasons, e.g., another process/container on the
>>>>>> machine used more memory than expected, Yarn NM is not configured with
>>>>>> enough memory reserved for the system processes, etc.
>>>>>>
>>>>>> I would suggest to first look into the machine memory usages, see
>>>>>> whether the Flink process indeed uses more memory than expected. This 
>>>>>> could
>>>>>> be achieved via:
>>>>>> - Run the `top` command
>>>>>> - Look into the `/proc/meminfo` file
>>>>>> - Any container memory usage metrics that are available to your Yarn
>>>>>> cluster
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <ori....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> After the job is running for 10 days in production, TaskManagers
>>>>>>> start failing with:
>>>>>>>
>>>>>>> Connection unexpectedly closed by remote task manager
>>>>>>>
>>>>>>> Looking in the machine logs, I can see the following error:
>>>>>>>
>>>>>>> ============= Java processes for user hadoop =============
>>>>>>> OpenJDK 64-Bit Server VM warning: INFO:
>>>>>>> os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; 
>>>>>>> error='Cannot
>>>>>>> allocate memory' (err
>>>>>>> #
>>>>>>> # There is insufficient memory for the Java Runtime Environment to
>>>>>>> continue.
>>>>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes for
>>>>>>> committing reserved memory.
>>>>>>> # An error report file with more information is saved as:
>>>>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
>>>>>>> =========== End java processes for user hadoop ===========
>>>>>>>
>>>>>>> In addition, the metrics for the TaskManager show very low Heap
>>>>>>> memory consumption (20% of Xmx).
>>>>>>>
>>>>>>> Hence, I suspect there is a memory leak in the TaskManager's Managed
>>>>>>> Memory.
>>>>>>>
>>>>>>> This my TaskManager's memory detail:
>>>>>>> flink process 112g
>>>>>>> framework.heap.size 0.2g
>>>>>>> task.heap.size 50g
>>>>>>> managed.size 54g
>>>>>>> framework.off-heap.size 0.5g
>>>>>>> task.off-heap.size 1g
>>>>>>> network 2g
>>>>>>> XX:MaxMetaspaceSize 1g
>>>>>>>
>>>>>>> As you can see, the managed memory is 54g, so it's already high (my
>>>>>>> managed.fraction is set to 0.5).
>>>>>>>
>>>>>>> I'm running Flink 1.10. Full job details attached.
>>>>>>>
>>>>>>> Can someone advise what would cause a managed memory leak?
>>>>>>>
>>>>>>>
>>>>>>>

Reply via email to