Hi Felipe,

Try to answer your below questions.

> I understand that I am tracking latency every 10 seconds for each physical 
> instance operator. Is that right?

Generally right. The latency marker is emitted from source and flow through all 
the intermediate operators until sink. This interval controls the emitting 
frequency of source.

> The backpressure goes away but the 99th percentile latency is still the same. 
> Why? Does it have no relation with each other?

The latency might be influenced by buffer flush timeout, network transport and 
load, etc.  In the case of backpressure, there are huge in-flight data 
accumulated in network wire, so the latency marker is queuing to wait for 
network transport which might bring obvious delay. Even the latency marker can 
not be emitted in time from source because of no available buffers temporarily. 

After the backpressure goes away, that does not mean there are no accumulated 
buffers on network wire, just not reaching the degree of backpressure. So the 
latency marker still needs to be queued with accumulated buffers on the wire. 
And it might take some time to digest the previous accumulated buffers 
completed to relax the latency. I guess it might be your case. You can monitor 
the metrics of "inputQueueLength" and "outputQueueLength" for confirming the 
status. Anyway, the answer is yes that it has relation with backpressure, but 
might have some delay to see the changes obviously.

>In the end I left the experiment for more than 2 hours running and only after 
>about 1,5 hour the 99th percentile latency got down to milliseconds. Is that 
>normal?

I guess it is normal as mentioned above.  After there are no accumulated 
buffers in network stack completely without backpressure, it should go down to 
milliseconds.

Best,
Zhijiang
------------------------------------------------------------------
From:Felipe Gutierrez <felipe.o.gutier...@gmail.com>
Send Time:2020 Mar. 6 (Fri.) 05:04
To:user <user@flink.apache.org>
Subject:Backpressure and 99th percentile latency

Hi,

I am a bit confused about the topic of tracking latency in Flink [1]. It says 
if I use the latency track I am measuring the Flinkā€™s network stack but 
application code latencies also can influence it. For instance, if I am using 
the metrics.latency.granularity: operator (default) and 
setLatencyTrackingInterval(10000). I understand that I am tracking latency 
every 10 seconds for each physical instance operator. Is that right?

In my application, I am tracking the latency of all aggregators. When I have a 
high workload and I can see backpressure from the flink UI the 99th percentile 
latency is 13, 25, 21, and 25 seconds. Then I set my aggregator to have a 
larger window. The backpressure goes away but the 99th percentile latency is 
still the same. Why? Does it have no relation with each other?

In the end I left the experiment for more than 2 hours running and only after 
about 1,5 hour the 99th percentile latency got down to milliseconds. Is that 
normal? Please see the figure attached.

[1] 
https://flink.apache.org/2019/07/23/flink-network-stack-2.html#latency-tracking

Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez

-- https://felipeogutierrez.blogspot.com

Reply via email to