Here is the checkpoint config. no concurrent checkpoints with 2 minute
checkpoint interval and timeout.

Problem is gone after redeployment. I will try if I can reproduce the issue

[image: Inline image 1]

On Fri, Dec 1, 2017 at 6:17 AM, Nico Kruber <n...@data-artisans.com> wrote:

> Hi Steven,
> by default, checkpoints time out after 10 minutes if you haven't used
> CheckpointConfig#setCheckpointTimeout() to change this timeout.
>
> Depending on your checkpoint interval, and your number of concurrent
> checkpoints, there may already be some other checkpoint processes
> running while you are waiting for the first to finish. In that case,
> succeeding checkpoints may also fail with a timeout. However, they
> should definitely get back to normal once your sink has caught up with
> all buffered events.
>
> I included Stefan who may shed some more light onto it, but maybe you
> can help us identifying the problem by providing logs at DEBUG level
> (did akka report any connection loss and gated actors? or maybe some
> other error in there?) or even a minimal program to reproduce.
>
>
> Nico
>
> On 01/12/17 07:36, Steven Wu wrote:
> >
> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint
> > 9353 expired before completing
> >
> > I might know why this happened in the first place. Our sink operator
> > does synchronous HTTP post, which had a 15-mint latency spike when this
> > all started. This could block flink threads and prevent checkpoint from
> > completing in time. But I don't understand why checkpoint continued to
> > fail after HTTP post latency returned to normal. there seems to be some
> > lingering/cascading effect of previous failed checkpoints on future
> > checkpoints. Only after I redeploy/restart the job an hour later,
> > checkpoint starts to work again.
> >
> > Would appreciate any suggestions/insights!
> >
> > Thanks,
> > Steven
>
>

Reply via email to