Here is the checkpoint config. no concurrent checkpoints with 2 minute checkpoint interval and timeout.
Problem is gone after redeployment. I will try if I can reproduce the issue [image: Inline image 1] On Fri, Dec 1, 2017 at 6:17 AM, Nico Kruber <n...@data-artisans.com> wrote: > Hi Steven, > by default, checkpoints time out after 10 minutes if you haven't used > CheckpointConfig#setCheckpointTimeout() to change this timeout. > > Depending on your checkpoint interval, and your number of concurrent > checkpoints, there may already be some other checkpoint processes > running while you are waiting for the first to finish. In that case, > succeeding checkpoints may also fail with a timeout. However, they > should definitely get back to normal once your sink has caught up with > all buffered events. > > I included Stefan who may shed some more light onto it, but maybe you > can help us identifying the problem by providing logs at DEBUG level > (did akka report any connection loss and gated actors? or maybe some > other error in there?) or even a minimal program to reproduce. > > > Nico > > On 01/12/17 07:36, Steven Wu wrote: > > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > > 9353 expired before completing > > > > I might know why this happened in the first place. Our sink operator > > does synchronous HTTP post, which had a 15-mint latency spike when this > > all started. This could block flink threads and prevent checkpoint from > > completing in time. But I don't understand why checkpoint continued to > > fail after HTTP post latency returned to normal. there seems to be some > > lingering/cascading effect of previous failed checkpoints on future > > checkpoints. Only after I redeploy/restart the job an hour later, > > checkpoint starts to work again. > > > > Would appreciate any suggestions/insights! > > > > Thanks, > > Steven > >