Stephan, thanks a lot for the explanation. Now everything makes sense to
me. Will set the min pause.

On Sat, Dec 2, 2017 at 8:58 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi Steven!
>
> You are right, there could be some cascading effect from previous
> checkpoints.
> I think the best way to handle that is to set the "minimum pause between
> checkpoints". In fact, I would actually recommend this over the checkpoint
> interval parameter.
>
> The pause will allow the job to handle such effects that built up during
> an unhealthy checkpoint. You can for example set the checkpoint interval to
> 2 mins and set the pause to 1.5 mins. That way, if a checkpoint takes
> longer than usual, the next one will still wait for 1.5 mins after the
> previous one completed or expired, giving the job time to catch up.
>
> Best,
> Stephan
>
>
> On Fri, Dec 1, 2017 at 10:10 PM, Steven Wu <stevenz...@gmail.com> wrote:
>
>> Here is the checkpoint config. no concurrent checkpoints with 2 minute
>> checkpoint interval and timeout.
>>
>> Problem is gone after redeployment. I will try if I can reproduce the
>> issue
>>
>> [image: Inline image 1]
>>
>> On Fri, Dec 1, 2017 at 6:17 AM, Nico Kruber <n...@data-artisans.com>
>> wrote:
>>
>>> Hi Steven,
>>> by default, checkpoints time out after 10 minutes if you haven't used
>>> CheckpointConfig#setCheckpointTimeout() to change this timeout.
>>>
>>> Depending on your checkpoint interval, and your number of concurrent
>>> checkpoints, there may already be some other checkpoint processes
>>> running while you are waiting for the first to finish. In that case,
>>> succeeding checkpoints may also fail with a timeout. However, they
>>> should definitely get back to normal once your sink has caught up with
>>> all buffered events.
>>>
>>> I included Stefan who may shed some more light onto it, but maybe you
>>> can help us identifying the problem by providing logs at DEBUG level
>>> (did akka report any connection loss and gated actors? or maybe some
>>> other error in there?) or even a minimal program to reproduce.
>>>
>>>
>>> Nico
>>>
>>> On 01/12/17 07:36, Steven Wu wrote:
>>> >
>>> > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint
>>> > 9353 expired before completing
>>> >
>>> > I might know why this happened in the first place. Our sink operator
>>> > does synchronous HTTP post, which had a 15-mint latency spike when this
>>> > all started. This could block flink threads and prevent checkpoint from
>>> > completing in time. But I don't understand why checkpoint continued to
>>> > fail after HTTP post latency returned to normal. there seems to be some
>>> > lingering/cascading effect of previous failed checkpoints on future
>>> > checkpoints. Only after I redeploy/restart the job an hour later,
>>> > checkpoint starts to work again.
>>> >
>>> > Would appreciate any suggestions/insights!
>>> >
>>> > Thanks,
>>> > Steven
>>>
>>>
>>
>

Reply via email to