Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Seth Wiesman Tue, 02 Mar 2021 14:36:26 -0800

Cool feature +1

There is a subsection called monitoring in the operations section of the
docs. It would fit nicely there.


Seth

On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <[email protected]>
wrote:

> Hi Piotr,
>
> Thanks for the comments - all valid points.
> We should definitely document how the Flame Graphs are constructed - I will
> work on the docs. Do you have a proposition about the part of which
> page/section they should become?
> I would like to also mention here that I plan to work on further
> improvements, such as the ability to "zoom in" into the Flame Graphs for
> the individual Tasks in the "unsquashed" form,  so some of those concerns
> should be mitigated in the future.
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
>
> On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[email protected]>
> wrote:
>
> > Nice feature +1 from my side for it.
> >
> > In the PR I think we are missing documentation. I think it's especially
> > important to mention the limitations of this approach for performance
> > analysis. If we make it easy for the user to get such kind of data, it's
> > important they do not proverbially shoot themselves in their own foot
> with
> > false conclusions. We should clearly mention how those data are sampled,
> > and point to limitations such as:
> > - data from all threads/operators are squashed together, so if there is a
> > data skew it will be averaged out
> > - stack sampling is/can be biased (JVM threads are more likely to be
> > stopped in some places than others, while skipping/rarely stopping in the
> > true hot spots - so one should treat the results with a grain of salt
> below
> > a certain level)
> > - true bottleneck might be obscured by the fact flame graphs are
> squashing
> > results from all of the threads. There can be 60% of time spent in one
> > component according to a flame graph, but it might not necessarily be the
> > bottleneck, which could be in a completely different component running
> > which has a single thread burning 100% of a single CPU core, barely
> visible
> > in the flame graph at all.
> >
> > It's great to have such a nice tool readily and easily available, but we
> > need to make sure people who are using it are aware when it can be
> > misleading.
> >
> > Piotrek
> >
> > wt., 2 mar 2021 o 15:12 Till Rohrmann <[email protected]> napisał(a):
> >
> > > Ah ok. Thanks for the clarification Alex.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> > [email protected]>
> > > wrote:
> > >
> > > > It is passed back as part of the response to the asynchronous
> callback
> > > > within the coordinator and is used to decide if all outstanding
> > requests
> > > to
> > > > the parallel instances of a particular operator returned
> successfully.
> > If
> > > > so, the request is considered successful, sub-results are combined
> and
> > > the
> > > > thread info result future for an operator completes.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> > > >
> > > > Best,
> > > >
> > > > --
> > > >
> > > > Alexander Fedulov | Solutions Architect
> > > >
> > > > <https://www.ververica.com/>
> > > >
> > > > Follow us @VervericaData
> > > >
> > > >
> > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[email protected]>
> > > > wrote:
> > > >
> > > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples
> > > need
> > > > to
> > > > > specify the request id? Is it because the caller can send a second
> > > > request
> > > > > with the same id? Or can the caller query the result of a previous
> > > > request
> > > > > by specifying the requestId?
> > > > >
> > > > > If the TaskExecutor does not need to know about the id, then we
> might
> > > be
> > > > > able to drop it.
> > > > >
> > > > > Cheers
> > > > > Till
> > > > >
> > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Till,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > * What is the requestId used for in the RPC call?
> > > > > >
> > > > > > It is the handle that is used as the key in the
> > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe
> it
> > > was
> > > > > > called sampleId in the StackTraceSampleCoordinator, but I decided
> > to
> > > > > rename
> > > > > > it because there is also a ThreadInfoSampleService which is
> > actually
> > > > > > responsible for sampling the JVM numSamples number of times. I
> > found
> > > > that
> > > > > > the notion of what a sample is was a bit confusing. Now one
> thread
> > > info
> > > > > > request corresponds to gathering numSamples from a corresponding
> > > Task.
> > > > > Hope
> > > > > > that makes sense.
> > > > > >
> > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples
> > and
> > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> > > > decrease
> > > > > > the number of parameters and group those which are closely
> related
> > > > > > together.
> > > > > >
> > > > > > Good point. I will rework it accordingly.
> > > > > >
> > > > > > Best,
> > > > > > --
> > > > > >
> > > > > > Alexander Fedulov | Solutions Architect
> > > > > > Follow us @VervericaData
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sent from:
> > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Reply via email to