Hi, Have you tried using a more recent Flink version? 1.8.x is no longer supported, and latest versions might not have this issue anymore.
Secondly, have you tried backtracking those references to the Finalizers? Assuming that Finalizer is indeed the class causing problems. Also it may well be a non Flink issue [1]. Best regards, Piotrek [1] https://issues.apache.org/jira/browse/KAFKA-8546 czw., 3 wrz 2020 o 04:47 Josson Paul <jossonp...@gmail.com> napisaĆ(a): > Hi All, > > *ISSUE* > ------ > Flink application runs for sometime and suddenly the CPU shoots up and > touches the peak, POD memory reaches to the peak, GC count increases, > Old-gen spaces reach close to 100%. Full GC doesn't clean up heap space. At > this point I stopped sending the data and cancelled the Flink Jobs. Still > the Old-Gen space doesn't come down. I took a heap dump and can see that > lot of Objects in the java.lang.Finalizer class. I have attached the > details in a word document. I do have the heap dump but it is close to 2GB > of compressed size. Is it safe to upload somewhere and share it here?. > > This issue doesn't happen in Flink: 1.4.0 and Beam: release-2.4.0 > > *WORKING CLUSTER INFO* (Flink: 1.4.0 and Beam: release-2.4.0) > ---------------------------------------------------- > > Application reads from Kafka and does aggregations and writes into Kafka. > Application has 5 minutes windows. Application uses Beam constructs to > build the pipeline. To read and write we use Beam connectors. > > Flink version: 1.4.0 > Beam version: release-2.4.0 > Backend State: State backend is in the Heap and check pointing happening > to the distributed File System. > > No of task Managers: 1 > Heap: 6.4 GB > CPU: 4 Cores > Standalone cluster deployment on a Kubernetes pod > > *NOT WORKING CLUSTER INFO* (Flink version: 1.8.3 and Beam version: > release-2.15.0) > ---------- > Application details are same as above > > *No change in application and the rate at which data is injected. But > change in Flink and Beam versions* > > > Flink version: 1.8.3 > Beam version: release-2.15.0 > Backend State: State backend is in the Heap and check pointing happening > to the distributed File System. > > No of task Managers: 1 > Heap: 6.5 GB > CPU: 4 Cores > > Deployment: Standalone cluster deployment on a Kubernetes pod > > My Observations > ------------- > > 1) CPU flame graph shows that in the working version, the cpu time on GC > is lesser compared to non-working version (Please see the attached Flame > Graph. *CPU-flame-WORKING.svg* for working cluster and > *CPU-flame-NOT-working.svg*) > > 2) I have attached the flame graph for native memory MALLOC calls when the > issue was happening. Please find the attached SVG image ( > *malloc-NOT-working.svg*). The POD memory peaks when this issue happens. > For me, it looks like the GC process is requesting a lot of native memory. > > 3) When the issue is happening the GC cpu usage is very high. Please see > the flame graph (*CPU-graph-at-issuetime.svg*) > > Note: SVG file can be opened using any browser and it is clickable while > opened. > -- > Thanks > Josson >