Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Till Rohrmann Fri, 05 Mar 2021 00:29:51 -0800

Thanks Alex. From my perspective we could continue with the vote now.

Cheers,
Till


On Thu, Mar 4, 2021 at 9:15 PM Alexander Fedulov <alexan...@ververica.com>
wrote:

> @Till, I've added the proposed ThreadInfoSamplesRequest and updated the
> FLIP and the PR accordingly.
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>
>
>
> On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <alexan...@ververica.com>
> wrote:
>
> > Added docs to the PR.
> > @David, thanks for the tip, it seems like a good place to put them.
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> >
> >
> >
> > On Wed, Mar 3, 2021 at 12:10 PM David Anderson <dander...@apache.org>
> > wrote:
> >
> >> This is going to make performance analysis and optimization much more
> >> accessible. I can't wait to include this in our training courses.
> >>
> >> +1
> >>
> >> Seth suggested putting the docs for this feature under
> >> Operations/Monitoring, but there's already a page in the docs under
> >> Operations/Debugging for Application Profiling & Debugging, which is
> more
> >> on point. I think it will be confusing if the flame graphs aren't
> >> together with the other profilers.
> >>
> >> David
> >>
> >> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <sjwies...@gmail.com>
> wrote:
> >>
> >> > Cool feature +1
> >> >
> >> > There is a subsection called monitoring in the operations section of
> the
> >> > docs. It would fit nicely there.
> >> >
> >> > Seth
> >> >
> >> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <
> >> alexan...@ververica.com>
> >> > wrote:
> >> >
> >> > > Hi Piotr,
> >> > >
> >> > > Thanks for the comments - all valid points.
> >> > > We should definitely document how the Flame Graphs are constructed
> - I
> >> > will
> >> > > work on the docs. Do you have a proposition about the part of which
> >> > > page/section they should become?
> >> > > I would like to also mention here that I plan to work on further
> >> > > improvements, such as the ability to "zoom in" into the Flame Graphs
> >> for
> >> > > the individual Tasks in the "unsquashed" form,  so some of those
> >> concerns
> >> > > should be mitigated in the future.
> >> > >
> >> > > Best,
> >> > >
> >> > > --
> >> > >
> >> > > Alexander Fedulov | Solutions Architect
> >> > >
> >> > > <https://www.ververica.com/>
> >> > >
> >> > > Follow us @VervericaData
> >> > >
> >> > >
> >> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <pnowoj...@apache.org
> >
> >> > > wrote:
> >> > >
> >> > > > Nice feature +1 from my side for it.
> >> > > >
> >> > > > In the PR I think we are missing documentation. I think it's
> >> especially
> >> > > > important to mention the limitations of this approach for
> >> performance
> >> > > > analysis. If we make it easy for the user to get such kind of
> data,
> >> > it's
> >> > > > important they do not proverbially shoot themselves in their own
> >> foot
> >> > > with
> >> > > > false conclusions. We should clearly mention how those data are
> >> > sampled,
> >> > > > and point to limitations such as:
> >> > > > - data from all threads/operators are squashed together, so if
> there
> >> > is a
> >> > > > data skew it will be averaged out
> >> > > > - stack sampling is/can be biased (JVM threads are more likely to
> be
> >> > > > stopped in some places than others, while skipping/rarely stopping
> >> in
> >> > the
> >> > > > true hot spots - so one should treat the results with a grain of
> >> salt
> >> > > below
> >> > > > a certain level)
> >> > > > - true bottleneck might be obscured by the fact flame graphs are
> >> > > squashing
> >> > > > results from all of the threads. There can be 60% of time spent in
> >> one
> >> > > > component according to a flame graph, but it might not necessarily
> >> be
> >> > the
> >> > > > bottleneck, which could be in a completely different component
> >> running
> >> > > > which has a single thread burning 100% of a single CPU core,
> barely
> >> > > visible
> >> > > > in the flame graph at all.
> >> > > >
> >> > > > It's great to have such a nice tool readily and easily available,
> >> but
> >> > we
> >> > > > need to make sure people who are using it are aware when it can be
> >> > > > misleading.
> >> > > >
> >> > > > Piotrek
> >> > > >
> >> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <trohrm...@apache.org>
> >> > napisał(a):
> >> > > >
> >> > > > > Ah ok. Thanks for the clarification Alex.
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > >
> >> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> >> > > > alexan...@ververica.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > It is passed back as part of the response to the asynchronous
> >> > > callback
> >> > > > > > within the coordinator and is used to decide if all
> outstanding
> >> > > > requests
> >> > > > > to
> >> > > > > > the parallel instances of a particular operator returned
> >> > > successfully.
> >> > > > If
> >> > > > > > so, the request is considered successful, sub-results are
> >> combined
> >> > > and
> >> > > > > the
> >> > > > > > thread info result future for an operator completes.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> >> > > > > >
> >> > > > > > Best,
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > Alexander Fedulov | Solutions Architect
> >> > > > > >
> >> > > > > > <https://www.ververica.com/>
> >> > > > > >
> >> > > > > > Follow us @VervericaData
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
> >> > trohrm...@apache.org>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Why does the caller of
> >> > TaskExecutorGateway.requestThreadInfoSamples
> >> > > > > need
> >> > > > > > to
> >> > > > > > > specify the request id? Is it because the caller can send a
> >> > second
> >> > > > > > request
> >> > > > > > > with the same id? Or can the caller query the result of a
> >> > previous
> >> > > > > > request
> >> > > > > > > by specifying the requestId?
> >> > > > > > >
> >> > > > > > > If the TaskExecutor does not need to know about the id, then
> >> we
> >> > > might
> >> > > > > be
> >> > > > > > > able to drop it.
> >> > > > > > >
> >> > > > > > > Cheers
> >> > > > > > > Till
> >> > > > > > >
> >> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> >> > > > > > alexan...@ververica.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Till,
> >> > > > > > > >
> >> > > > > > > > Thanks for your comments.
> >> > > > > > > >
> >> > > > > > > > * What is the requestId used for in the RPC call?
> >> > > > > > > >
> >> > > > > > > > It is the handle that is used as the key in the
> >> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I
> >> believe
> >> > > it
> >> > > > > was
> >> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
> >> > decided
> >> > > > to
> >> > > > > > > rename
> >> > > > > > > > it because there is also a ThreadInfoSampleService which
> is
> >> > > > actually
> >> > > > > > > > responsible for sampling the JVM numSamples number of
> >> times. I
> >> > > > found
> >> > > > > > that
> >> > > > > > > > the notion of what a sample is was a bit confusing. Now
> one
> >> > > thread
> >> > > > > info
> >> > > > > > > > request corresponds to gathering numSamples from a
> >> > corresponding
> >> > > > > Task.
> >> > > > > > > Hope
> >> > > > > > > > that makes sense.
> >> > > > > > > >
> >> > > > > > > > * Would it make sense to group numSubSamples,
> >> > delayBetweenSamples
> >> > > > and
> >> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
> >> > would
> >> > > > > > decrease
> >> > > > > > > > the number of parameters and group those which are closely
> >> > > related
> >> > > > > > > > together.
> >> > > > > > > >
> >> > > > > > > > Good point. I will rework it accordingly.
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > --
> >> > > > > > > >
> >> > > > > > > > Alexander Fedulov | Solutions Architect
> >> > > > > > > > Follow us @VervericaData
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Sent from:
> >> > > > > > >
> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Reply via email to