Hi,

Lately we overhauled the backpressure detection [1] and a screenshot
preview of those efforts is attached here [2]. I encourage you to check the
1.13 RC0 build and how the current mechanism works for you [3]. To support
those WebUI changes we have added a couple of new metrics:
backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond.

1. I believe that solves 1.
2. This still requires a bit of manual investigation. Once you locate
backpressuring task, you can check the detail subtask stats to check if all
parallel instances are uniformly backpressured/busy or not. If you would
like to add a hint "it looks like you have a data skew in Task XYZ ", that
I believe could be added to the WebUI.
3. The tricky part is how to display this kind of information. Currently I
would recommend just export/report
backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
metrics for every task to an external system and  display them for example
in Graphana.

The blog post you are referencing is quite outdated, especially with those
new changes from 1.13. I'm hoping to write a new one pretty soon.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-14712
[2]
https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
[3]
http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3c1d2412ce-d4d0-ed50-6181-1b610e16d...@apache.org%3E

pon., 5 kwi 2021 o 23:20 Lu Niu <qqib...@gmail.com> napisał(a):

> Hi, Flink dev
>
> Lately, we want to develop some tools to:
> 1. show backpressure operator without manual operation
> 2. Provide suggestions to mitigate back pressure after checking data skew,
> external service RPC etc.
> 3. Show back pressure history
>
> Could anyone share their experience with such tooling?
> Also, I notice backpressure monitoring and detection is mentioned across
> multiple places. Could someone help to explain how these connect to each
> other? Maybe some of them are outdated? Thanks!
>
> 1. The official doc introduces monitoring back pressure through web UI.
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html, it
> says outPoolUsage, inPoolUsage metrics can be used to determine back
> pressure.
> 3. Latest flink version introduces metrics called “isBackPressured" But I
> didn't find related documentation on usage.
>
> Best
> Lu
>

Reply via email to