Hi. Are there good ways to debug long Flink checkpoint durations? I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.
I looked through the text logs and didn't see much. I assume: 1) I have something misconfigured that is causing old state is sticking around. 2) I don't have enough resources.