Then I would suggest the following. - Check the task manager log to see if the '-D' properties are properly loaded. They should be located at the beginning of the log file. - You can also try to log into the pod and check the JVM launch command with "ps -ef | grep TaskManagerRunner". I suspect there might be some argument passing problem regarding the spaces and double quotation marks.
Thank you~ Xintong Song On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <eleanore....@gmail.com> wrote: > Hi Xintong, > > Thanks for the detailed explanation! > > as for the 2nd question: I mount it to am emptyDir, I assume pod restart > will not cause the pod to be rescheduled to another node, so it should > stay? I verified by directly adding this to the flink-conf.yaml, which I > see the heap dump is taken and stays in the directory: env.java.opts: > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps > > In addition, I also don't see the log print out something like: Heap dump > file created [5220997112 bytes in 73.464 secs], which I see when directly > adding the options in the flink-conf.yaml > > containers: > > - volumeMounts: > > - mountPath: /dumps > > name: heap-dumps > > volumes: > > - emptyDir: {} > > name: heap-dumps > > > Thanks a lot! > > Eleanore > > On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <tonysong...@gmail.com> > wrote: > >> Hi Eleanore, >> >> I'd like to explain about 1 & 2. For 3, I have no idea either. >> >> 1. I dont see the heap size from UI for task manager show correctly >>> >> >> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the >> total memory of a Flink task manager, rather than only the heap memory. A >> Flink task manager process consumes not only java heap memory, but also >> direct memory (e.g., network buffers) and native memory (e.g., JVM >> overhead). That's why the JVM heap size shown on the UI is much smaller >> than the configured 'taskmanager.heap.size'. Please refer to this document >> [1] for more details. This document comes from Flink 1.9 and has not been >> back-ported to 1.8, but the contents should apply to 1.8 as well. >> >> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did >>> I set the java opts wrong? >>> >> >> The java options look good to me. It the configured path '/dumps/oom.bin' >> a local path of the pod or a path of the host mounted onto the pod? The >> restarted pod is a completely new different pod. Everything you write to >> the old pod goes away as the pod terminated, unless they are written to the >> host through mounted storage. >> >> Thank you~ >> >> Xintong Song >> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html >> >> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <eleanore....@gmail.com> >> wrote: >> >>> Hi All, >>> >>> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4 >>> pods, each pod with 4 parallelism. >>> >>> The flink job reads from a source topic with 96 partitions, and does per >>> element filter, the filtered value comes from a broadcast topic and it >>> always use the latest message as the filter criteria, then publish to a >>> sink topic. >>> >>> There is no checkpointing and state involved. >>> >>> Then I am seeing GC overhead limit exceeded error continuously and the >>> pods keep on restarting >>> >>> So I tried to increase the heap size for task manager by >>> >>> containers: >>> >>> - args: >>> >>> - task-manager >>> >>> - -Djobmanager.rpc.address=service-job-manager >>> >>> - -Dtaskmanager.heap.size=4096m >>> >>> - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError >>> -XX:HeapDumpPath=/dumps/oom.bin" >>> >>> >>> 3 things I noticed, >>> >>> >>> 1. I dont see the heap size from UI for task manager show correctly >>> >>> [image: image.png] >>> >>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, >>> did I set the java opts wrong? >>> >>> 3. I continously seeing below logs from all pods, not sure if causes any >>> issue >>> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer >>> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the >>> fetch request with (sessionId=2054451921, epoch=474): >>> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000} >>> >>> Thanks a lot for any help! >>> >>> Best, >>> Eleanore >>> >>