org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 9353
expired before completing

I might know why this happened in the first place. Our sink operator does
synchronous HTTP post, which had a 15-mint latency spike when this all
started. This could block flink threads and prevent checkpoint from
completing in time. But I don't understand why checkpoint continued to fail
after HTTP post latency returned to normal. there seems to be some
lingering/cascading effect of previous failed checkpoints on future
checkpoints. Only after I redeploy/restart the job an hour later,
checkpoint starts to work again.

Would appreciate any suggestions/insights!

Thanks,
Steven

Reply via email to