Hi Paul, for now I only plan to add the one based on java stack traces. On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <paullin3...@gmail.com> wrote:
> Hi David, > > Thanks for the new feature! I think the flame graph would be a useful tool > to understand the state of job executions, and it looks good too. +1 for > this. > > And a minor question: do we plan to support multiple kinds of flame > graphs? It would be great if we have both on-cpu and off-cpu flame graphs. > > Best, > Paul Lam > > > 在 2019年8月2日,04:24,David Morávek <david.mora...@gmail.com> 写道: > > > > Hi Till, thanks for the feedback! These endpoints are only called when > the > > vertex is selected in the UI, so there should be any heavy RPC load. For > > back-pressure, we only sample top 3 calls of the stack (depth = 3). For > the > > flame-graph, we want to sample the whole stack trace and we need > different > > sampling rate (longer period, more samples). Those are the main reasons > to > > split these in two "trackers", but I may be missing something. > > > > I've prepared a little demo, so others can have a better idea of what I > > have in mind. > > > > https://youtu.be/GUNDehj9z9o > > > > Please note that this is a proof of concept and I'm not frontend person, > so > > it may look little clumsy :) > > > > D. > > > > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <trohrm...@apache.org> > wrote: > > > >> Hi David, > >> > >> thanks for starting this discussion. I like the idea of improving > insights > >> into Flink's execution and I believe that a flame graph could be > helpful. > >> > >> I quickly glanced over your changes and I think they go in a good > >> direction. One idea could be to share the `StackTraceSample` produced by > >> the `StackTraceSampleCoordinator` between the different > >> `StackTraceOperatorTracker` so that we don't send multiple requests for > the > >> same operators. That way we would decrease a bit the RPC load. > >> > >> Apart from that, I think the next steps would be to find a committer who > >> could shepherd this effort and help you with merging it. > >> > >> Cheers, > >> Till > >> > >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <d...@apache.org> wrote: > >> > >>> Hello, > >>> > >>> While looking into Flink internals, I've noticed that there is already > a > >>> mechanism for stack-trace sampling of a particular job vertex. > >>> > >>> I think it may be really useful to allow user to easily render a cpu > >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI > >> for > >>> a > >>> selected vertex (new tab next to back pressure) of a running job. Back > >>> pressure tab already provides a good idea of which vertex causes > trouble, > >>> but it's hard to say what's actually going on. > >>> > >>> I've tried to implement a basic REST endpoint > >>> < > >>> > >> > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > >>>> , > >>> that prepares data for the flame graph rendering and it seems to be > >>> providing good insight. > >>> > >>> It should be straightforward to render data from the endpoint in new UI > >>> using existing <https://github.com/spiermar/d3-flame-graph> javascript > >>> libraries. > >>> > >>> WDYT? Is this worth pushing forward? > >>> > >>> D. > >>> > >> > >