Added docs to the PR. @David, thanks for the tip, it seems like a good place to put them.
-- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData On Wed, Mar 3, 2021 at 12:10 PM David Anderson <dander...@apache.org> wrote: > This is going to make performance analysis and optimization much more > accessible. I can't wait to include this in our training courses. > > +1 > > Seth suggested putting the docs for this feature under > Operations/Monitoring, but there's already a page in the docs under > Operations/Debugging for Application Profiling & Debugging, which is more > on point. I think it will be confusing if the flame graphs aren't > together with the other profilers. > > David > > On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <sjwies...@gmail.com> wrote: > > > Cool feature +1 > > > > There is a subsection called monitoring in the operations section of the > > docs. It would fit nicely there. > > > > Seth > > > > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov < > alexan...@ververica.com> > > wrote: > > > > > Hi Piotr, > > > > > > Thanks for the comments - all valid points. > > > We should definitely document how the Flame Graphs are constructed - I > > will > > > work on the docs. Do you have a proposition about the part of which > > > page/section they should become? > > > I would like to also mention here that I plan to work on further > > > improvements, such as the ability to "zoom in" into the Flame Graphs > for > > > the individual Tasks in the "unsquashed" form, so some of those > concerns > > > should be mitigated in the future. > > > > > > Best, > > > > > > -- > > > > > > Alexander Fedulov | Solutions Architect > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > > > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <pnowoj...@apache.org> > > > wrote: > > > > > > > Nice feature +1 from my side for it. > > > > > > > > In the PR I think we are missing documentation. I think it's > especially > > > > important to mention the limitations of this approach for performance > > > > analysis. If we make it easy for the user to get such kind of data, > > it's > > > > important they do not proverbially shoot themselves in their own foot > > > with > > > > false conclusions. We should clearly mention how those data are > > sampled, > > > > and point to limitations such as: > > > > - data from all threads/operators are squashed together, so if there > > is a > > > > data skew it will be averaged out > > > > - stack sampling is/can be biased (JVM threads are more likely to be > > > > stopped in some places than others, while skipping/rarely stopping in > > the > > > > true hot spots - so one should treat the results with a grain of salt > > > below > > > > a certain level) > > > > - true bottleneck might be obscured by the fact flame graphs are > > > squashing > > > > results from all of the threads. There can be 60% of time spent in > one > > > > component according to a flame graph, but it might not necessarily be > > the > > > > bottleneck, which could be in a completely different component > running > > > > which has a single thread burning 100% of a single CPU core, barely > > > visible > > > > in the flame graph at all. > > > > > > > > It's great to have such a nice tool readily and easily available, but > > we > > > > need to make sure people who are using it are aware when it can be > > > > misleading. > > > > > > > > Piotrek > > > > > > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <trohrm...@apache.org> > > napisaĆ(a): > > > > > > > > > Ah ok. Thanks for the clarification Alex. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > > > > alexan...@ververica.com> > > > > > wrote: > > > > > > > > > > > It is passed back as part of the response to the asynchronous > > > callback > > > > > > within the coordinator and is used to decide if all outstanding > > > > requests > > > > > to > > > > > > the parallel instances of a particular operator returned > > > successfully. > > > > If > > > > > > so, the request is considered successful, sub-results are > combined > > > and > > > > > the > > > > > > thread info result future for an operator completes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > > > > > > > > > Best, > > > > > > > > > > > > -- > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > > > > > <https://www.ververica.com/> > > > > > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < > > trohrm...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Why does the caller of > > TaskExecutorGateway.requestThreadInfoSamples > > > > > need > > > > > > to > > > > > > > specify the request id? Is it because the caller can send a > > second > > > > > > request > > > > > > > with the same id? Or can the caller query the result of a > > previous > > > > > > request > > > > > > > by specifying the requestId? > > > > > > > > > > > > > > If the TaskExecutor does not need to know about the id, then we > > > might > > > > > be > > > > > > > able to drop it. > > > > > > > > > > > > > > Cheers > > > > > > > Till > > > > > > > > > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > > > > > alexan...@ververica.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Till, > > > > > > > > > > > > > > > > Thanks for your comments. > > > > > > > > > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > > > > > > > > > It is the handle that is used as the key in the > > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I > believe > > > it > > > > > was > > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I > > decided > > > > to > > > > > > > rename > > > > > > > > it because there is also a ThreadInfoSampleService which is > > > > actually > > > > > > > > responsible for sampling the JVM numSamples number of times. > I > > > > found > > > > > > that > > > > > > > > the notion of what a sample is was a bit confusing. Now one > > > thread > > > > > info > > > > > > > > request corresponds to gathering numSamples from a > > corresponding > > > > > Task. > > > > > > > Hope > > > > > > > > that makes sense. > > > > > > > > > > > > > > > > * Would it make sense to group numSubSamples, > > delayBetweenSamples > > > > and > > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This > > would > > > > > > decrease > > > > > > > > the number of parameters and group those which are closely > > > related > > > > > > > > together. > > > > > > > > > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > > > > > > > > > Best, > > > > > > > > -- > > > > > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Sent from: > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >