Alright, here's another case where this is very pronounced. Here's a link to a couple of screenshots showing the overall stats for a slow task as well as a zoom in on the slowest of them: https://pasteboard.co/IxhGWXz.png
This is the sink stage of a pipeline with 3 upstream tasks. All the upstream subtasks complete their checkpoints end-to-end in under 10 seconds. Most of the sink subtasks also complete end-to-end in under a few seconds. There are a few that take a minute or so (which is also indicative of a problem) but then there is one that takes 29 minutes. The sink here is the StreamingFileSink. It does seem that each subtask that has a high end-to-end time also has a substantially higher alignment time but the end-to-end time is much larger than just alignment. I suppose the correct interpretation of this is that the end-to-end time alone indicates heavy backpressure / slow progress making on the slow subtasks and since they are moving so slowly that also explains how there could be a relatively high alignment time as well. The skew in the barrier arrival times is essentially amplified since the subtasks are making their way through the data so darn slowly. It's still very hard to understand how this sink could be taking so long to make progress. I mean unless I misunderstand something the total amount of data that has to be worked through to receive a barrier can't be more than what is buffered in Flink's network stack in the worst case, right? How could it take 29 minutes to consume this data in the sink? Anyway, I'd appreciate and feedback or insights. Thanks :) -Jamie On Fri, Sep 13, 2019 at 12:11 PM Jamie Grier <jgr...@lyft.com> wrote: > Thanks Seth and Stephan, > > Yup, I had intended to upload a image. Here it is: > https://pasteboard.co/Ixg0YP2.png > > This one is very simple and I suppose can be explained by heavy > backpressure. The more complex version of this problem I run into > frequently is where a single (or a couple of) sub-task(s) in a highly > parallel job takes up to an order of magnitude longer than others with > regard to end-to-end time and usually ends up causing checkpoint timeouts. > This often occurs in the sink task but that's not the only case I've seen. > > The only metric that shows a high value will still be end-to-end time. > Sync, async, and alignment times are all negligible. This, to me, is very > hard to understand, especially when this task will take 10 minutes+ to > complete and everything else takes seconds. > > Rather than speak hypothetically on this I'll post some data to this > thread as this situation occurs again. Maybe we can make sense of it > together. > > Thanks a lot for the help. > > -Jamie > > > . > > On Thu, Sep 12, 2019 at 10:57 AM Stephan Ewen <se...@apache.org> wrote: > >> Hi Jamie! >> >> Did you mean to attach a screenshot? If yes, you need to share that >> through >> a different channel, the mailing list does not support attachments, >> unfortunately. >> >> Seth is right how the time is measured. >> One important bit to add to the interpretation: >> - For non-source tasks, the time include the "travel of the barriers", >> which can take long under back pressure >> - For source tasks, it includes the "time to acquire the checkpoint >> lock", which can be long if the source is blocked in trying to emit data >> (again, backpressure). >> >> As part of FLIP-27 we will eliminate the checkpoint lock (have a mailbox >> instead) which should lead to faster lock acquisition. >> >> The "unaligned checkpoints" discussion is looking at ways to make >> checkpoints much less susceptible to back pressure. >> >> >> https://lists.apache.org/thread.html/fd5b6cceb4bffb635e26e7ec0787a8db454ddd64aadb40a0d08a90a8@%3Cdev.flink.apache.org%3E >> >> Hope that helps understanding what is going on. >> >> Best, >> Stephan >> >> >> On Thu, Sep 12, 2019 at 1:25 AM Seth Wiesman <sjwies...@gmail.com> wrote: >> >> > Great timing, I just debugged this on Monday. E2e time is checkpoint >> > coordinator to checkpoint coordinator, so it includes RPC to the source >> and >> > RPC from the operator back for the JM. >> > >> > Seth >> > >> > > On Sep 11, 2019, at 6:17 PM, Jamie Grier <jgr...@lyft.com.invalid> >> > wrote: >> > > >> > > Hey all, >> > > >> > > I need to make sense of this behavior. Any help would be appreciated. >> > > >> > > Here’s an example of a set of Flink checkpoint metrics I don’t >> > understand. This is the first operator in a job and as you can see the >> > end-to-end time for the checkpoint is long, but it’s not explained by >> > either sync, async, or alignment times. I’m not sure what to make of >> > this. It makes me think I don’t understand the meaning of the metrics >> > themselves. In my interpretation the end-to-end time should always be, >> > roughly, the sum of the other components — certainly in the case of a >> > source task such as this. >> > > >> > > Any thoughts or clarifications anyone can provide on this? We have >> many >> > jobs with slow checkpoints that suffer from this sort of thing with >> metrics >> > that look similar. >> > > >> > > -Jamie >> > > >> > >> >