Re: Backpressure and 99th percentile latency

Felipe Gutierrez Mon, 09 Mar 2020 02:42:31 -0700

Indeed, it is a bit tricky to understand the relation between
floatingBuffersUsage, exclusiveBuffersUsage. I am reading again that
table on (https://flink.apache.org/2019/07/23/flink-network-stack-2.html)
but I guess I can rely on the latency metric that I implemented on my
operator (not the default latency tracking from Flink).


Thanks for the insight points!
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
-- https://felipeogutierrez.blogspot.com

On Sat, Mar 7, 2020 at 4:36 PM Zhijiang <wangzhijiang...@aliyun.com> wrote:
>
> Thanks for the feedback Felipe!
> Regarding with your below concern:
>
> > Although I think it is better to use outPoolUsage and inPoolUsage according 
> > to [1]. However, in your opinion is it better (faster to see) to use 
> > inputQueueLength and
> > outputQueueLength or outPoolUsage and inPoolUsage to monitor a consequence 
> > of backpressure? I mean, is there a faster way to show that the latency 
> > increased due to
> > backpressure? Maybe if I create my own metric on my own operator or udf?
>
> The blog [1] already gave a great explanation of network stack for users in 
> general and I also have the consensus on this issue.
> In particular,I can provide some further notes for your understanding.
>
> 1. It is not easy for users to get the precise total amount of input & output 
> buffers, so we are not aware of whether the input & output buffers are 
> exhausted and backpressure is happened from the metrics of 
> input&outputQueueLength. In contrast, we can know easily that input & 
> outputPoolUsage should both reach 100% once backpressure happening.
>
> 2. The inputPoolUsage has the different semantic from release-1.9. Before 1.9 
> this metric is only for measuring the usage of floating buffers. But from 1.9 
> it also covers the usage of exclusive buffers. That means from 1.9 you might 
> see the inputPoolUsage far from 100% when backpressure happens especially in 
> the data skew case, but the inputFloatingBufferUsage should be 100% instead.
>
> 3. The latency marker provided by flink framework is emitted to a random 
> channel (non-broadcast) every time because of performance concern. So it is 
> hard to say whether it is measuring the heavy-load channel or lightweight 
> channel in short while, especially in data skew scenario.
>
> 4. In theory the latency should be increased along with the trend of 
> increased input&outputQueueLength and input&outputPoolUsage. All of them 
> should be proportional to have the same trend in most cases.
>
> Best,
> Zhijiang
>
>
>
> ------------------------------------------------------------------
> From:Felipe Gutierrez <felipe.o.gutier...@gmail.com>
> Send Time:2020 Mar. 7 (Sat.) 18:49
> To:Arvid Heise <ar...@ververica.com>
> Cc:Zhijiang <wangzhijiang...@aliyun.com>; user <user@flink.apache.org>
> Subject:Re: Backpressure and 99th percentile latency
>
> Hi,
> I implemented my own histogram metric on my operator to measure the
> latency. The latency is following the throughput at the same pace now.
> The figures are attached.
>
> Best,
> Felipe
>
> --
> -- Felipe Gutierrez
> -- skype: felipe.o.gutierrez
> -- https://felipeogutierrez.blogspot.com
>
> On Fri, Mar 6, 2020 at 9:38 AM Felipe Gutierrez
> <felipe.o.gutier...@gmail.com> wrote:
> >
> > Thanks for the clarified answer @Zhijiang, I am gonna monitor
> > inputQueueLength and outputQueueLength to check some relation with
> > backpressure. Although I think it is better to use outPoolUsage and
> > inPoolUsage according to [1].
> > However, in your opinion is it better (faster to see) to use
> > inputQueueLength and outputQueueLength or outPoolUsage and inPoolUsage
> > to monitor a consequence of backpressure? I mean, is there a faster
> > way to show that the latency increased due to backpressure? Maybe if I
> > create my own metric on my own operator or udf?
> >
> > Thanks @Arvid. In the end I want to be able to hold SLAs. For me, the
> > SLA would be the minimum latency. If I understood correctly, in the
> > time that I started to have backpressure the latency track metrics are
> > not a very precise indication of how much backpressure my application
> > is suffering. It just indicates that there is backpressure.
> > What would you say that is more less precise metric to tune the
> > throughput in order to not have backpressure. Something like, if I
> > have 50,000 milliseconds of latency and the normal latency is 150
> > milliseconds, the throughput has to decrease by a factor of 50,000/150
> > times.
> >
> > Just a note. I am not changing the throughput of the sources yet. I am
> > changing the size of the window without restart the job. But I guess
> > they have the same meaning for this question.
> >
> > [1] https://flink.apache.org/2019/07/23/flink-network-stack-2.html
> >
> > --
> > -- Felipe Gutierrez
> > -- skype: felipe.o.gutierrez
> > -- https://felipeogutierrez.blogspot.com
> >
> > --
> > -- Felipe Gutierrez
> > -- skype: felipe.o.gutierrez
> > -- https://felipeogutierrez.blogspot.com
> >
> >
> > On Fri, Mar 6, 2020 at 8:17 AM Arvid Heise <ar...@ververica.com> wrote:
> > >
> > > Hi Felipe,
> > >
> > > latency under backpressure has to be carefully interpreted. Latency's 
> > > semantics actually require that the data source is read in a timely 
> > > manner; that is, there is no bottleneck in your pipeline where data is 
> > > piling up.
> > >
> > > Thus, to measure latency in experiments you must ensure that the current 
> > > throughput is below the maximum throughput, for example by gradually 
> > > increasing the throughput with a generating source or through some 
> > > throttles on the external source. Until you reach the maximum throughput, 
> > > latencies semantics is exactly like you expect it. Everything after that 
> > > is more or less just reciprocal to backpressure.
> > >
> > > If you go away from the theoretical consideration and look at actual 
> > > setups, you can easily see why this distinction makes sense: if you have 
> > > a low-latency application, you are doomed if you have backpressure 
> > > (cannot hold SLAs). You would immediately rescale if you see signs of 
> > > backpressure (or even earlier). Then, latency always has the desired 
> > > semantics.
> > >
> > > On Fri, Mar 6, 2020 at 5:55 AM Zhijiang <wangzhijiang...@aliyun.com> 
> > > wrote:
> > >>
> > >> Hi Felipe,
> > >>
> > >> Try to answer your below questions.
> > >>
> > >> > I understand that I am tracking latency every 10 seconds for each 
> > >> > physical instance operator. Is that right?
> > >>
> > >> Generally right. The latency marker is emitted from source and flow 
> > >> through all the intermediate operators until sink. This interval 
> > >> controls the emitting frequency of source.
> > >>
> > >> > The backpressure goes away but the 99th percentile latency is still 
> > >> > the same. Why? Does it have no relation with each other?
> > >>
> > >> The latency might be influenced by buffer flush timeout, network 
> > >> transport and load, etc.  In the case of backpressure, there are huge 
> > >> in-flight data accumulated in network wire, so the latency marker is 
> > >> queuing to wait for network transport which might bring obvious delay. 
> > >> Even the latency marker can not be emitted in time from source because 
> > >> of no available buffers temporarily.
> > >>
> > >> After the backpressure goes away, that does not mean there are no 
> > >> accumulated buffers on network wire, just not reaching the degree of 
> > >> backpressure. So the latency marker still needs to be queued with 
> > >> accumulated buffers on the wire. And it might take some time to digest 
> > >> the previous accumulated buffers completed to relax the latency. I guess 
> > >> it might be your case. You can monitor the metrics of "inputQueueLength" 
> > >> and "outputQueueLength" for confirming the status. Anyway, the answer is 
> > >> yes that it has relation with backpressure, but might have some delay to 
> > >> see the changes obviously.
> > >>
> > >> >In the end I left the experiment for more than 2 hours running and only 
> > >> >after about 1,5 hour the 99th percentile latency got down to 
> > >> >milliseconds. Is that normal?
> > >>
> > >> I guess it is normal as mentioned above.  After there are no accumulated 
> > >> buffers in network stack completely without backpressure, it should go 
> > >> down to milliseconds.
> > >>
> > >> Best,
> > >> Zhijiang
> > >>
> > >> ------------------------------------------------------------------
> > >> From:Felipe Gutierrez <felipe.o.gutier...@gmail.com>
> > >> Send Time:2020 Mar. 6 (Fri.) 05:04
> > >> To:user <user@flink.apache.org>
> > >> Subject:Backpressure and 99th percentile latency
> > >>
> > >> Hi,
> > >>
> > >> I am a bit confused about the topic of tracking latency in Flink [1]. It 
> > >> says if I use the latency track I am measuring the Flink’s network stack 
> > >> but application code latencies also can influence it. For instance, if I 
> > >> am using the metrics.latency.granularity: operator (default) and 
> > >> setLatencyTrackingInterval(10000). I understand that I am tracking 
> > >> latency every 10 seconds for each physical instance operator. Is that 
> > >> right?
> > >>
> > >> In my application, I am tracking the latency of all aggregators. When I 
> > >> have a high workload and I can see backpressure from the flink UI the 
> > >> 99th percentile latency is 13, 25, 21, and 25 seconds. Then I set my 
> > >> aggregator to have a larger window. The backpressure goes away but the 
> > >> 99th percentile latency is still the same. Why? Does it have no relation 
> > >> with each other?
> > >>
> > >> In the end I left the experiment for more than 2 hours running and only 
> > >> after about 1,5 hour the 99th percentile latency got down to 
> > >> milliseconds. Is that normal? Please see the figure attached.
> > >>
> > >> [1] 
> > >> https://flink.apache.org/2019/07/23/flink-network-stack-2.html#latency-tracking
> > >>
> > >> Thanks,
> > >> Felipe
> > >> --
> > >> -- Felipe Gutierrez
> > >> -- skype: felipe.o.gutierrez
> > >> -- https://felipeogutierrez.blogspot.com
> > >>
> > >>
>
>

Re: Backpressure and 99th percentile latency

Reply via email to