I've created FLINK-13550 <https://issues.apache.org/jira/browse/FLINK-13550> to track the issue.
Is there any committer who'd be willing to "shepherd this effort"? :) Thanks, D. On Fri, Aug 2, 2019 at 10:22 AM David Morávek <d...@apache.org> wrote: > Hi Paul, for now I only plan to add the one based on java stack traces. > > On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <paullin3...@gmail.com> wrote: > >> Hi David, >> >> Thanks for the new feature! I think the flame graph would be a useful >> tool to understand the state of job executions, and it looks good too. +1 >> for this. >> >> And a minor question: do we plan to support multiple kinds of flame >> graphs? It would be great if we have both on-cpu and off-cpu flame graphs. >> >> Best, >> Paul Lam >> >> > 在 2019年8月2日,04:24,David Morávek <david.mora...@gmail.com> 写道: >> > >> > Hi Till, thanks for the feedback! These endpoints are only called when >> the >> > vertex is selected in the UI, so there should be any heavy RPC load. For >> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For >> the >> > flame-graph, we want to sample the whole stack trace and we need >> different >> > sampling rate (longer period, more samples). Those are the main reasons >> to >> > split these in two "trackers", but I may be missing something. >> > >> > I've prepared a little demo, so others can have a better idea of what I >> > have in mind. >> > >> > https://youtu.be/GUNDehj9z9o >> > >> > Please note that this is a proof of concept and I'm not frontend >> person, so >> > it may look little clumsy :) >> > >> > D. >> > >> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> > >> >> Hi David, >> >> >> >> thanks for starting this discussion. I like the idea of improving >> insights >> >> into Flink's execution and I believe that a flame graph could be >> helpful. >> >> >> >> I quickly glanced over your changes and I think they go in a good >> >> direction. One idea could be to share the `StackTraceSample` produced >> by >> >> the `StackTraceSampleCoordinator` between the different >> >> `StackTraceOperatorTracker` so that we don't send multiple requests >> for the >> >> same operators. That way we would decrease a bit the RPC load. >> >> >> >> Apart from that, I think the next steps would be to find a committer >> who >> >> could shepherd this effort and help you with merging it. >> >> >> >> Cheers, >> >> Till >> >> >> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <d...@apache.org> wrote: >> >> >> >>> Hello, >> >>> >> >>> While looking into Flink internals, I've noticed that there is >> already a >> >>> mechanism for stack-trace sampling of a particular job vertex. >> >>> >> >>> I think it may be really useful to allow user to easily render a cpu >> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI >> >> for >> >>> a >> >>> selected vertex (new tab next to back pressure) of a running job. Back >> >>> pressure tab already provides a good idea of which vertex causes >> trouble, >> >>> but it's hard to say what's actually going on. >> >>> >> >>> I've tried to implement a basic REST endpoint >> >>> < >> >>> >> >> >> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 >> >>>> , >> >>> that prepares data for the flame graph rendering and it seems to be >> >>> providing good insight. >> >>> >> >>> It should be straightforward to render data from the endpoint in new >> UI >> >>> using existing <https://github.com/spiermar/d3-flame-graph> >> javascript >> >>> libraries. >> >>> >> >>> WDYT? Is this worth pushing forward? >> >>> >> >>> D. >> >>> >> >> >> >>