Re: Checkpoint metrics.

2019-10-02 Thread Stephan Ewen
Hi Jamie! (and adding Klou) I think the Streaming FIle Sink has a limit on the number of concurrent uploads. Could it be that too many uploads enqueue and at some point, the checkpoint blocks for a long time until that queue is worked off? Klou, do you have more insights here? Best, Stephan On

Re: Checkpoint metrics.

2019-09-14 Thread Jamie Grier
Thanks Konstantin, Refining this a little bit.. All the checkpoints for all the subtasks upstream of the sink complete in seconds. Most of the subtasks of the sink itself also complete in seconds other than these very few "slow" ones. So, somehow we are taking at worst 29 minutes to clear the d

Re: Checkpoint metrics.

2019-09-14 Thread Konstantin Knauf
Hi Jamie, I think, your interpretation is correct. It takes a long time until the first barrier reaches the "slow" subtask and in case of the screenshot another 3m 22s until the last barrier reaches the subtask. Regarding the total amount of data: depending on the your checkpoint configuration (es

Re: Checkpoint metrics.

2019-09-13 Thread Jamie Grier
Alright, here's another case where this is very pronounced. Here's a link to a couple of screenshots showing the overall stats for a slow task as well as a zoom in on the slowest of them: https://pasteboard.co/IxhGWXz.png This is the sink stage of a pipeline with 3 upstream tasks. All the upstr

Re: Checkpoint metrics.

2019-09-13 Thread Jamie Grier
Here's the second screenshot I forgot to include: https://pasteboard.co/IxhNIhc.png On Fri, Sep 13, 2019 at 4:34 PM Jamie Grier wrote: > Alright, here's another case where this is very pronounced. Here's a link > to a couple of screenshots showing the overall stats for a slow task as > well as

Re: Checkpoint metrics.

2019-09-13 Thread Jamie Grier
Thanks Seth and Stephan, Yup, I had intended to upload a image. Here it is: https://pasteboard.co/Ixg0YP2.png This one is very simple and I suppose can be explained by heavy backpressure. The more complex version of this problem I run into frequently is where a single (or a couple of) sub-task(

Re: Checkpoint metrics.

2019-09-12 Thread Stephan Ewen
Hi Jamie! Did you mean to attach a screenshot? If yes, you need to share that through a different channel, the mailing list does not support attachments, unfortunately. Seth is right how the time is measured. One important bit to add to the interpretation: - For non-source tasks, the time inclu

Re: Checkpoint metrics.

2019-09-11 Thread Seth Wiesman
Great timing, I just debugged this on Monday. E2e time is checkpoint coordinator to checkpoint coordinator, so it includes RPC to the source and RPC from the operator back for the JM. Seth > On Sep 11, 2019, at 6:17 PM, Jamie Grier wrote: > > Hey all, > > I need to make sense of this behav