Re: Checkpoint metrics.

Jamie Grier Fri, 13 Sep 2019 16:35:44 -0700

Here's the second screenshot I forgot to include:
https://pasteboard.co/IxhNIhc.png


On Fri, Sep 13, 2019 at 4:34 PM Jamie Grier <jgr...@lyft.com> wrote:

> Alright, here's another case where this is very pronounced.  Here's a link
> to a couple of screenshots showing the overall stats for a slow task as
> well as a zoom in on the slowest of them:
> https://pasteboard.co/IxhGWXz.png
>
> This is the sink stage of a pipeline with 3 upstream tasks.  All the
> upstream subtasks complete their checkpoints end-to-end in under 10
> seconds.  Most of the sink subtasks also complete end-to-end in under a few
> seconds.  There are a few that take a minute or so (which is also
> indicative of a problem) but then there is one that takes 29 minutes.
>
> The sink here is the StreamingFileSink.
>
> It does seem that each subtask that has a high end-to-end time also has a
> substantially higher alignment time but the end-to-end time is much larger
> than just alignment.
>
> I suppose the correct interpretation of this is that the end-to-end time
> alone indicates heavy backpressure / slow progress making on the slow
> subtasks and since they are moving so slowly that also explains how there
> could be a relatively high alignment time as well.  The skew in the barrier
> arrival times is essentially amplified since the subtasks are making their
> way through the data so darn slowly.
>
> It's still very hard to understand how this sink could be taking so long
> to make progress.  I mean unless I misunderstand something the total amount
> of data that has to be worked through to receive a barrier can't be more
> than what is buffered in Flink's network stack in the worst case, right?
> How could it take 29 minutes to consume this data in the sink?
>
> Anyway, I'd appreciate and feedback or insights.
>
> Thanks :)
>
> -Jamie
>
>
> On Fri, Sep 13, 2019 at 12:11 PM Jamie Grier <jgr...@lyft.com> wrote:
>
>> Thanks Seth and Stephan,
>>
>> Yup, I had intended to upload a image.  Here it is:
>> https://pasteboard.co/Ixg0YP2.png
>>
>> This one is very simple and I suppose can be explained by heavy
>> backpressure.  The more complex version of this problem I run into
>> frequently is where a single (or a couple of) sub-task(s) in a highly
>> parallel job takes up to an order of magnitude longer than others with
>> regard to end-to-end time and usually ends up causing checkpoint timeouts.
>> This often occurs in the sink task but that's not the only case I've seen.
>>
>> The only metric that shows a high value will still be end-to-end time.
>> Sync, async, and alignment times are all negligible.  This, to me, is very
>> hard to understand, especially when this task will take 10 minutes+ to
>> complete and everything else takes seconds.
>>
>> Rather than speak hypothetically on this I'll post some data to this
>> thread as this situation occurs again.  Maybe we can make sense of it
>> together.
>>
>> Thanks a lot for the help.
>>
>> -Jamie
>>
>>
>> .
>>
>> On Thu, Sep 12, 2019 at 10:57 AM Stephan Ewen <se...@apache.org> wrote:
>>
>>> Hi Jamie!
>>>
>>> Did you mean to attach a screenshot? If yes, you need to share that
>>> through
>>> a different channel, the mailing list does not support attachments,
>>> unfortunately.
>>>
>>> Seth is right how the time is measured.
>>> One important bit to add to the interpretation:
>>>   - For non-source tasks, the time include the "travel of the barriers",
>>> which can take long under back pressure
>>>   - For source tasks, it includes the "time to acquire the checkpoint
>>> lock", which can be long if the source is blocked in trying to emit data
>>> (again, backpressure).
>>>
>>> As part of FLIP-27 we will eliminate the checkpoint lock (have a mailbox
>>> instead) which should lead to faster lock acquisition.
>>>
>>> The "unaligned checkpoints" discussion is looking at ways to make
>>> checkpoints much less susceptible to back pressure.
>>>
>>>
>>> https://lists.apache.org/thread.html/fd5b6cceb4bffb635e26e7ec0787a8db454ddd64aadb40a0d08a90a8@%3Cdev.flink.apache.org%3E
>>>
>>> Hope that helps understanding what is going on.
>>>
>>> Best,
>>> Stephan
>>>
>>>
>>> On Thu, Sep 12, 2019 at 1:25 AM Seth Wiesman <sjwies...@gmail.com>
>>> wrote:
>>>
>>> > Great timing, I just debugged this on Monday. E2e time is checkpoint
>>> > coordinator to checkpoint coordinator, so it includes RPC to the
>>> source and
>>> > RPC from the operator back for the JM.
>>> >
>>> > Seth
>>> >
>>> > > On Sep 11, 2019, at 6:17 PM, Jamie Grier <jgr...@lyft.com.invalid>
>>> > wrote:
>>> > >
>>> > > Hey all,
>>> > >
>>> > > I need to make sense of this behavior.  Any help would be
>>> appreciated.
>>> > >
>>> > > Here’s an example of a set of Flink checkpoint metrics I don’t
>>> > understand.  This is the first operator in a job and as you can see the
>>> > end-to-end time for the checkpoint is long, but it’s not explained by
>>> > either sync, async, or alignment times.  I’m not sure what to make of
>>> > this.  It makes me think I don’t understand the meaning of the metrics
>>> > themselves.  In my interpretation the end-to-end time should always be,
>>> > roughly, the sum of the other components — certainly in the case of a
>>> > source task such as this.
>>> > >
>>> > > Any thoughts or clarifications anyone can provide on this?  We have
>>> many
>>> > jobs with slow checkpoints that suffer from this sort of thing with
>>> metrics
>>> > that look similar.
>>> > >
>>> > > -Jamie
>>> > >
>>> >
>>>
>>

Re: Checkpoint metrics.

Reply via email to