Then I would suggest the following.
- Check the task manager log to see if the '-D' properties are properly
loaded. They should be located at the beginning of the log file.
- You can also try to log into the pod and check the JVM launch command
with "ps -ef | grep TaskManagerRunner". I suspect there might be some
argument passing problem regarding the spaces and double quotation marks.

Thank you~

Xintong Song

On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <>

> Hi Xintong,
> Thanks for the detailed explanation!
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
> containers:
> - volumeMounts:
>         - mountPath: /dumps
>           name: heap-dumps
> volumes:
>       - emptyDir: {}
>         name: heap-dumps
> Thanks a lot!
> Eleanore
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <>
> wrote:
>> Hi Eleanore,
>> I'd like to explain about 1 & 2. For 3, I have no idea either.
>> 1. I dont see the heap size from UI for task manager show correctly
>> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
>> total memory of a Flink task manager, rather than only the heap memory. A
>> Flink task manager process consumes not only java heap memory, but also
>> direct memory (e.g., network buffers) and native memory (e.g., JVM
>> overhead). That's why the JVM heap size shown on the UI is much smaller
>> than the configured 'taskmanager.heap.size'. Please refer to this document
>> [1] for more details. This document comes from Flink 1.9 and has not been
>> back-ported to 1.8, but the contents should apply to 1.8 as well.
>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
>>> I set the java opts wrong?
>> The java options look good to me. It the configured path '/dumps/oom.bin'
>> a local path of the pod or a path of the host mounted onto the pod? The
>> restarted pod is a completely new different pod. Everything you write to
>> the old pod goes away as the pod terminated, unless they are written to the
>> host through mounted storage.
>> Thank you~
>> Xintong Song
>> [1]
>> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <>
>> wrote:
>>> Hi All,
>>> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
>>> pods, each pod with 4 parallelism.
>>> The flink job reads from a source topic with 96 partitions, and does per
>>> element filter, the filtered value comes from a broadcast topic and it
>>> always use the latest message as the filter criteria, then publish to a
>>> sink topic.
>>> There is no checkpointing and state involved.
>>> Then I am seeing GC overhead limit exceeded error continuously and the
>>> pods keep on restarting
>>> So I tried to increase the heap size for task manager by
>>> containers:
>>>       - args:
>>>         - task-manager
>>>         - -Djobmanager.rpc.address=service-job-manager
>>>         - -Dtaskmanager.heap.size=4096m
>>>         -"-XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/dumps/oom.bin"
>>> 3 things I noticed,
>>> 1. I dont see the heap size from UI for task manager show correctly
>>> [image: image.png]
>>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin,
>>> did I set the java opts wrong?
>>> 3. I continously seeing below logs from all pods, not sure if causes any
>>> issue
>>> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
>>> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
>>> fetch request with (sessionId=2054451921, epoch=474):
>>> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>>> Thanks a lot for any help!
>>> Best,
>>> Eleanore

Reply via email to