Hello,

I am observing a failure whenever I trigger a savepoint on my Flink
Application which otherwise runs without issues

The app is deployed via AWS KDA(Kubernetes) with 256 KPU(6 Task managers
with 43 slots each. 1 KPU = 1 vCPU, 4GB Memory, and 50GB Diskspace. It uses
RocksDB backend)

The savepoint completes successfully with a larger cluster 512 KPU. The
savepoint size is about 150 GB which should fit easily within 256 KPU app
as well.

I suspect that there is a resource leak somewhere but the number of threads
and heap memory usage look normal(under 50%).

How should I go about debugging the issue and what other metrics should I
be looking at?
Note that the failure occurs only when a savepoint is triggered

For Job Graph and full exception:
Ref:
https://stackoverflow.com/questions/68077200/flink-application-failure-on-savepoint


Thank you

Best,
Abhishek

Reply via email to