Hi Abhishek,

Does your job use checkpointing? It seems like it's the first time the
respective checkpoint/savepoint thread pool is touched and at this point,
there are not enough handles.

Do you have a way to inspect the ulimits on the task managers?
If you don't have a way to change the limits, you could also try to reduce
the slots per task manager such that the Java process has more handles.

I'm CCing some folks that have more insights into KDA and might be able to
give more specific pointers.

Best,

Arvid

On Thu, Jul 1, 2021 at 5:01 AM Abhishek SP <abhisheksp1...@gmail.com> wrote:

>
> Hello,
>
> I am observing a failure whenever I trigger a savepoint on my Flink
> Application which otherwise runs without issues
>
> The app is deployed via AWS KDA(Kubernetes) with 256 KPU(6 Task managers
> with 43 slots each. 1 KPU = 1 vCPU, 4GB Memory, and 50GB Diskspace. It uses
> RocksDB backend)
>
> The savepoint completes successfully with a larger cluster 512 KPU. The
> savepoint size is about 150 GB which should fit easily within 256 KPU app
> as well.
>
> I suspect that there is a resource leak somewhere but the number of
> threads and heap memory usage look normal(under 50%).
>
> How should I go about debugging the issue and what other metrics should I
> be looking at?
> Note that the failure occurs only when a savepoint is triggered
>
> For Job Graph and full exception:
> Ref:
> https://stackoverflow.com/questions/68077200/flink-application-failure-on-savepoint
>
>
> Thank you
>
> Best,
> Abhishek
>

Reply via email to