Thanks Seth and Stephan,

Yup, I had intended to upload a image.  Here it is:
https://pasteboard.co/Ixg0YP2.png

This one is very simple and I suppose can be explained by heavy
backpressure.  The more complex version of this problem I run into
frequently is where a single (or a couple of) sub-task(s) in a highly
parallel job takes up to an order of magnitude longer than others with
regard to end-to-end time and usually ends up causing checkpoint timeouts.
This often occurs in the sink task but that's not the only case I've seen.

The only metric that shows a high value will still be end-to-end time.
Sync, async, and alignment times are all negligible.  This, to me, is very
hard to understand, especially when this task will take 10 minutes+ to
complete and everything else takes seconds.

Rather than speak hypothetically on this I'll post some data to this thread
as this situation occurs again.  Maybe we can make sense of it together.

Thanks a lot for the help.

-Jamie


.

On Thu, Sep 12, 2019 at 10:57 AM Stephan Ewen <se...@apache.org> wrote:

> Hi Jamie!
>
> Did you mean to attach a screenshot? If yes, you need to share that through
> a different channel, the mailing list does not support attachments,
> unfortunately.
>
> Seth is right how the time is measured.
> One important bit to add to the interpretation:
>   - For non-source tasks, the time include the "travel of the barriers",
> which can take long under back pressure
>   - For source tasks, it includes the "time to acquire the checkpoint
> lock", which can be long if the source is blocked in trying to emit data
> (again, backpressure).
>
> As part of FLIP-27 we will eliminate the checkpoint lock (have a mailbox
> instead) which should lead to faster lock acquisition.
>
> The "unaligned checkpoints" discussion is looking at ways to make
> checkpoints much less susceptible to back pressure.
>
>
> https://lists.apache.org/thread.html/fd5b6cceb4bffb635e26e7ec0787a8db454ddd64aadb40a0d08a90a8@%3Cdev.flink.apache.org%3E
>
> Hope that helps understanding what is going on.
>
> Best,
> Stephan
>
>
> On Thu, Sep 12, 2019 at 1:25 AM Seth Wiesman <sjwies...@gmail.com> wrote:
>
> > Great timing, I just debugged this on Monday. E2e time is checkpoint
> > coordinator to checkpoint coordinator, so it includes RPC to the source
> and
> > RPC from the operator back for the JM.
> >
> > Seth
> >
> > > On Sep 11, 2019, at 6:17 PM, Jamie Grier <jgr...@lyft.com.invalid>
> > wrote:
> > >
> > > Hey all,
> > >
> > > I need to make sense of this behavior.  Any help would be appreciated.
> > >
> > > Here’s an example of a set of Flink checkpoint metrics I don’t
> > understand.  This is the first operator in a job and as you can see the
> > end-to-end time for the checkpoint is long, but it’s not explained by
> > either sync, async, or alignment times.  I’m not sure what to make of
> > this.  It makes me think I don’t understand the meaning of the metrics
> > themselves.  In my interpretation the end-to-end time should always be,
> > roughly, the sum of the other components — certainly in the case of a
> > source task such as this.
> > >
> > > Any thoughts or clarifications anyone can provide on this?  We have
> many
> > jobs with slow checkpoints that suffer from this sort of thing with
> metrics
> > that look similar.
> > >
> > > -Jamie
> > >
> >
>

Reply via email to