Re: Flink Task Manager GC overhead limit exceeded

Xintong Song Wed, 29 Apr 2020 21:08:54 -0700

Then I would suggest the following.
- Check the task manager log to see if the '-D' properties are properly
loaded. They should be located at the beginning of the log file.
- You can also try to log into the pod and check the JVM launch command
with "ps -ef | grep TaskManagerRunner". I suspect there might be some
argument passing problem regarding the spaces and double quotation marks.



Thank you~

Xintong Song



On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <eleanore....@gmail.com>
wrote:

> Hi Xintong,
>
> Thanks for the detailed explanation!
>
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:  env.java.opts:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
>
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
>
> containers:
>
> - volumeMounts:
>
>         - mountPath: /dumps
>
>           name: heap-dumps
>
> volumes:
>
>       - emptyDir: {}
>
>         name: heap-dumps
>
>
> Thanks a lot!
>
> Eleanore
>
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <tonysong...@gmail.com>
> wrote:
>
>> Hi Eleanore,
>>
>> I'd like to explain about 1 & 2. For 3, I have no idea either.
>>
>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>
>> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
>> total memory of a Flink task manager, rather than only the heap memory. A
>> Flink task manager process consumes not only java heap memory, but also
>> direct memory (e.g., network buffers) and native memory (e.g., JVM
>> overhead). That's why the JVM heap size shown on the UI is much smaller
>> than the configured 'taskmanager.heap.size'. Please refer to this document
>> [1] for more details. This document comes from Flink 1.9 and has not been
>> back-ported to 1.8, but the contents should apply to 1.8 as well.
>>
>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
>>> I set the java opts wrong?
>>>
>>
>> The java options look good to me. It the configured path '/dumps/oom.bin'
>> a local path of the pod or a path of the host mounted onto the pod? The
>> restarted pod is a completely new different pod. Everything you write to
>> the old pod goes away as the pod terminated, unless they are written to the
>> host through mounted storage.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html
>>
>> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <eleanore....@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
>>> pods, each pod with 4 parallelism.
>>>
>>> The flink job reads from a source topic with 96 partitions, and does per
>>> element filter, the filtered value comes from a broadcast topic and it
>>> always use the latest message as the filter criteria, then publish to a
>>> sink topic.
>>>
>>> There is no checkpointing and state involved.
>>>
>>> Then I am seeing GC overhead limit exceeded error continuously and the
>>> pods keep on restarting
>>>
>>> So I tried to increase the heap size for task manager by
>>>
>>> containers:
>>>
>>>       - args:
>>>
>>>         - task-manager
>>>
>>>         - -Djobmanager.rpc.address=service-job-manager
>>>
>>>         - -Dtaskmanager.heap.size=4096m
>>>
>>>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/dumps/oom.bin"
>>>
>>>
>>> 3 things I noticed,
>>>
>>>
>>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>> [image: image.png]
>>>
>>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin,
>>> did I set the java opts wrong?
>>>
>>> 3. I continously seeing below logs from all pods, not sure if causes any
>>> issue
>>> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
>>> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
>>> fetch request with (sessionId=2054451921, epoch=474):
>>> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>>>
>>> Thanks a lot for any help!
>>>
>>> Best,
>>> Eleanore
>>>
>>

Re: Flink Task Manager GC overhead limit exceeded

Reply via email to