Re: Automatic backpressure detection

Lu Niu Tue, 13 Apr 2021 11:23:26 -0700

Cool. Thanks!

Best
Lu


On Mon, Apr 12, 2021 at 11:27 PM Piotr Nowojski <[email protected]>
wrote:

> Hi,
>
> Yes. Back-pressure from AsyncOperator should be correctly reported via
> isBackPressured, backPressuredMsPerSecond metrics and by extension in the
> WebUI from 1.13.
>
> Piotre
>
> pon., 12 kwi 2021 o 23:17 Lu Niu <[email protected]> napisał(a):
>
> > Hi, Piotr
> >
> > Thanks for your detailed reply! It is mentioned here we cannot observe
> > backpressure generated from  AsyncOperator in Flink UI in 1.9.1. Is it
> > fixed in the latest version? Thank you!
> >
> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Async-Function-Not-Generating-Backpressure-td26766.html
> >
> > Best
> > Lu
> >
> > On Tue, Apr 6, 2021 at 11:14 PM Piotr Nowojski <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > Yes, you can use `isBackPressured` to monitor a task's back-pressure.
> > > However keep in mind:
> > > a) You are going to miss some nice way to visualize this information,
> > which
> > > is present in 1.13's WebUI.
> > > b) `isBackPressured` is a sampling based metric. If your job has
> varying
> > > load, for example all windows firing at the same processing time, every
> > > couple of seconds, causing intermittent back-pressure, this metric will
> > > show it randomly as `true` or `false`.
> > > c) `isBackPressured` is slightly less accurate compared to
> > > `backPressuredTimeMsPerSecond`. There are some corner cases when for a
> > > brief amount of time it can return `true`, while a task is still
> running,
> > > while the time based metrics work in a different much more accurate
> way.
> > >
> > > About back porting the patches, if you want to create a custom Flink
> > build
> > > it should be do-able. There will be some conflicts for sure, so you
> will
> > > need to understand Flink's code.
> > >
> > > Best,
> > > Piotrek
> > >
> > > śr., 7 kwi 2021 o 02:32 Lu Niu <[email protected]> napisał(a):
> > >
> > > > Hi, Piotr
> > > >
> > > > Thanks for replying!
> > > >
> > > > We don't have a plan to upgrade to 1.13 in short term. We are using
> > flink
> > > > 1.11 and I notice there is a metric called isBackpressured. Is that
> > > enough
> > > > to solve 1? If not, would backporting patches regarding
> > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > idleTimeMsPerSecond
> > > > work? And do you have an estimate of how difficult it is?
> > > >
> > > >
> > > > Best
> > > > Lu
> > > >
> > > >
> > > >
> > > > On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Lately we overhauled the backpressure detection [1] and a
> screenshot
> > > > > preview of those efforts is attached here [2]. I encourage you to
> > check
> > > > the
> > > > > 1.13 RC0 build and how the current mechanism works for you [3]. To
> > > > support
> > > > > those WebUI changes we have added a couple of new metrics:
> > > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > > > idleTimeMsPerSecond.
> > > > >
> > > > > 1. I believe that solves 1.
> > > > > 2. This still requires a bit of manual investigation. Once you
> locate
> > > > > backpressuring task, you can check the detail subtask stats to
> check
> > if
> > > > all
> > > > > parallel instances are uniformly backpressured/busy or not. If you
> > > would
> > > > > like to add a hint "it looks like you have a data skew in Task XYZ
> ",
> > > > that
> > > > > I believe could be added to the WebUI.
> > > > > 3. The tricky part is how to display this kind of information.
> > > Currently
> > > > I
> > > > > would recommend just export/report
> > > > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> > > idleTimeMsPerSecond
> > > > > metrics for every task to an external system and  display them for
> > > > example
> > > > > in Graphana.
> > > > >
> > > > > The blog post you are referencing is quite outdated, especially
> with
> > > > those
> > > > > new changes from 1.13. I'm hoping to write a new one pretty soon.
> > > > >
> > > > > Piotrek
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%[email protected]%3E
> > > > >
> > > > > pon., 5 kwi 2021 o 23:20 Lu Niu <[email protected]> napisał(a):
> > > > >
> > > > > > Hi, Flink dev
> > > > > >
> > > > > > Lately, we want to develop some tools to:
> > > > > > 1. show backpressure operator without manual operation
> > > > > > 2. Provide suggestions to mitigate back pressure after checking
> > data
> > > > > skew,
> > > > > > external service RPC etc.
> > > > > > 3. Show back pressure history
> > > > > >
> > > > > > Could anyone share their experience with such tooling?
> > > > > > Also, I notice backpressure monitoring and detection is mentioned
> > > > across
> > > > > > multiple places. Could someone help to explain how these connect
> to
> > > > each
> > > > > > other? Maybe some of them are outdated? Thanks!
> > > > > >
> > > > > > 1. The official doc introduces monitoring back pressure through
> web
> > > UI.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > > > > 2. In
> > https://flink.apache.org/2019/07/23/flink-network-stack-2.html
> > > ,
> > > > it
> > > > > > says outPoolUsage, inPoolUsage metrics can be used to determine
> > back
> > > > > > pressure.
> > > > > > 3. Latest flink version introduces metrics called
> “isBackPressured"
> > > > But I
> > > > > > didn't find related documentation on usage.
> > > > > >
> > > > > > Best
> > > > > > Lu
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Automatic backpressure detection

Reply via email to