Hi All, Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4 pods, each pod with 4 parallelism.
The flink job reads from a source topic with 96 partitions, and does per element filter, the filtered value comes from a broadcast topic and it always use the latest message as the filter criteria, then publish to a sink topic. There is no checkpointing and state involved. Then I am seeing GC overhead limit exceeded error continuously and the pods keep on restarting So I tried to increase the heap size for task manager by containers: - args: - task-manager - -Djobmanager.rpc.address=service-job-manager - -Dtaskmanager.heap.size=4096m - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/oom.bin" 3 things I noticed, 1. I dont see the heap size from UI for task manager show correctly [image: image.png] 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I set the java opts wrong? 3. I continously seeing below logs from all pods, not sure if causes any issue {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the fetch request with (sessionId=2054451921, epoch=474): FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000} Thanks a lot for any help! Best, Eleanore