org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 9353 expired before completing
I might know why this happened in the first place. Our sink operator does synchronous HTTP post, which had a 15-mint latency spike when this all started. This could block flink threads and prevent checkpoint from completing in time. But I don't understand why checkpoint continued to fail after HTTP post latency returned to normal. there seems to be some lingering/cascading effect of previous failed checkpoints on future checkpoints. Only after I redeploy/restart the job an hour later, checkpoint starts to work again. Would appreciate any suggestions/insights! Thanks, Steven