Great timing, I just debugged this on Monday. E2e time is checkpoint coordinator to checkpoint coordinator, so it includes RPC to the source and RPC from the operator back for the JM.
Seth > On Sep 11, 2019, at 6:17 PM, Jamie Grier <jgr...@lyft.com.invalid> wrote: > > Hey all, > > I need to make sense of this behavior. Any help would be appreciated. > > Here’s an example of a set of Flink checkpoint metrics I don’t understand. > This is the first operator in a job and as you can see the end-to-end time > for the checkpoint is long, but it’s not explained by either sync, async, or > alignment times. I’m not sure what to make of this. It makes me think I > don’t understand the meaning of the metrics themselves. In my interpretation > the end-to-end time should always be, roughly, the sum of the other > components — certainly in the case of a source task such as this. > > Any thoughts or clarifications anyone can provide on this? We have many jobs > with slow checkpoints that suffer from this sort of thing with metrics that > look similar. > > -Jamie >