Hello, I am observing a failure whenever I trigger a savepoint on my Flink Application which otherwise runs without issues
The app is deployed via AWS KDA(Kubernetes) with 256 KPU(6 Task managers with 43 slots each. 1 KPU = 1 vCPU, 4GB Memory, and 50GB Diskspace. It uses RocksDB backend) The savepoint completes successfully with a larger cluster 512 KPU. The savepoint size is about 150 GB which should fit easily within 256 KPU app as well. I suspect that there is a resource leak somewhere but the number of threads and heap memory usage look normal(under 50%). How should I go about debugging the issue and what other metrics should I be looking at? Note that the failure occurs only when a savepoint is triggered For Job Graph and full exception: Ref: https://stackoverflow.com/questions/68077200/flink-application-failure-on-savepoint Thank you Best, Abhishek