Here's the second screenshot I forgot to include: https://pasteboard.co/IxhNIhc.png
On Fri, Sep 13, 2019 at 4:34 PM Jamie Grier <jgr...@lyft.com> wrote: > Alright, here's another case where this is very pronounced. Here's a link > to a couple of screenshots showing the overall stats for a slow task as > well as a zoom in on the slowest of them: > https://pasteboard.co/IxhGWXz.png > > This is the sink stage of a pipeline with 3 upstream tasks. All the > upstream subtasks complete their checkpoints end-to-end in under 10 > seconds. Most of the sink subtasks also complete end-to-end in under a few > seconds. There are a few that take a minute or so (which is also > indicative of a problem) but then there is one that takes 29 minutes. > > The sink here is the StreamingFileSink. > > It does seem that each subtask that has a high end-to-end time also has a > substantially higher alignment time but the end-to-end time is much larger > than just alignment. > > I suppose the correct interpretation of this is that the end-to-end time > alone indicates heavy backpressure / slow progress making on the slow > subtasks and since they are moving so slowly that also explains how there > could be a relatively high alignment time as well. The skew in the barrier > arrival times is essentially amplified since the subtasks are making their > way through the data so darn slowly. > > It's still very hard to understand how this sink could be taking so long > to make progress. I mean unless I misunderstand something the total amount > of data that has to be worked through to receive a barrier can't be more > than what is buffered in Flink's network stack in the worst case, right? > How could it take 29 minutes to consume this data in the sink? > > Anyway, I'd appreciate and feedback or insights. > > Thanks :) > > -Jamie > > > On Fri, Sep 13, 2019 at 12:11 PM Jamie Grier <jgr...@lyft.com> wrote: > >> Thanks Seth and Stephan, >> >> Yup, I had intended to upload a image. Here it is: >> https://pasteboard.co/Ixg0YP2.png >> >> This one is very simple and I suppose can be explained by heavy >> backpressure. The more complex version of this problem I run into >> frequently is where a single (or a couple of) sub-task(s) in a highly >> parallel job takes up to an order of magnitude longer than others with >> regard to end-to-end time and usually ends up causing checkpoint timeouts. >> This often occurs in the sink task but that's not the only case I've seen. >> >> The only metric that shows a high value will still be end-to-end time. >> Sync, async, and alignment times are all negligible. This, to me, is very >> hard to understand, especially when this task will take 10 minutes+ to >> complete and everything else takes seconds. >> >> Rather than speak hypothetically on this I'll post some data to this >> thread as this situation occurs again. Maybe we can make sense of it >> together. >> >> Thanks a lot for the help. >> >> -Jamie >> >> >> . >> >> On Thu, Sep 12, 2019 at 10:57 AM Stephan Ewen <se...@apache.org> wrote: >> >>> Hi Jamie! >>> >>> Did you mean to attach a screenshot? If yes, you need to share that >>> through >>> a different channel, the mailing list does not support attachments, >>> unfortunately. >>> >>> Seth is right how the time is measured. >>> One important bit to add to the interpretation: >>> - For non-source tasks, the time include the "travel of the barriers", >>> which can take long under back pressure >>> - For source tasks, it includes the "time to acquire the checkpoint >>> lock", which can be long if the source is blocked in trying to emit data >>> (again, backpressure). >>> >>> As part of FLIP-27 we will eliminate the checkpoint lock (have a mailbox >>> instead) which should lead to faster lock acquisition. >>> >>> The "unaligned checkpoints" discussion is looking at ways to make >>> checkpoints much less susceptible to back pressure. >>> >>> >>> https://lists.apache.org/thread.html/fd5b6cceb4bffb635e26e7ec0787a8db454ddd64aadb40a0d08a90a8@%3Cdev.flink.apache.org%3E >>> >>> Hope that helps understanding what is going on. >>> >>> Best, >>> Stephan >>> >>> >>> On Thu, Sep 12, 2019 at 1:25 AM Seth Wiesman <sjwies...@gmail.com> >>> wrote: >>> >>> > Great timing, I just debugged this on Monday. E2e time is checkpoint >>> > coordinator to checkpoint coordinator, so it includes RPC to the >>> source and >>> > RPC from the operator back for the JM. >>> > >>> > Seth >>> > >>> > > On Sep 11, 2019, at 6:17 PM, Jamie Grier <jgr...@lyft.com.invalid> >>> > wrote: >>> > > >>> > > Hey all, >>> > > >>> > > I need to make sense of this behavior. Any help would be >>> appreciated. >>> > > >>> > > Here’s an example of a set of Flink checkpoint metrics I don’t >>> > understand. This is the first operator in a job and as you can see the >>> > end-to-end time for the checkpoint is long, but it’s not explained by >>> > either sync, async, or alignment times. I’m not sure what to make of >>> > this. It makes me think I don’t understand the meaning of the metrics >>> > themselves. In my interpretation the end-to-end time should always be, >>> > roughly, the sum of the other components — certainly in the case of a >>> > source task such as this. >>> > > >>> > > Any thoughts or clarifications anyone can provide on this? We have >>> many >>> > jobs with slow checkpoints that suffer from this sort of thing with >>> metrics >>> > that look similar. >>> > > >>> > > -Jamie >>> > > >>> > >>> >>