Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Paul Lam Fri, 02 Aug 2019 00:35:26 -0700

Hi David,

Thanks for the new feature! I think the flame graph would be a useful tool to 
understand the state of job executions, and it looks good too. +1 for this.


And a minor question: do we plan to support multiple kinds of flame graphs? It 
would be great if we have both on-cpu and off-cpu flame graphs.

Best,
Paul Lam

> 在 2019年8月2日，04:24，David Morávek <[email protected]> 写道：
> 
> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
> 
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
> 
> https://youtu.be/GUNDehj9z9o
> 
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
> 
> D.
> 
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[email protected]> wrote:
> 
>> Hi David,
>> 
>> thanks for starting this discussion. I like the idea of improving insights
>> into Flink's execution and I believe that a flame graph could be helpful.
>> 
>> I quickly glanced over your changes and I think they go in a good
>> direction. One idea could be to share the `StackTraceSample` produced by
>> the `StackTraceSampleCoordinator` between the different
>> `StackTraceOperatorTracker` so that we don't send multiple requests for the
>> same operators. That way we would decrease a bit the RPC load.
>> 
>> Apart from that, I think the next steps would be to find a committer who
>> could shepherd this effort and help you with merging it.
>> 
>> Cheers,
>> Till
>> 
>> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> While looking into Flink internals, I've noticed that there is already a
>>> mechanism for stack-trace sampling of a particular job vertex.
>>> 
>>> I think it may be really useful to allow user to easily render a cpu
>>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> for
>>> a
>>> selected vertex (new tab next to back pressure) of a running job. Back
>>> pressure tab already provides a good idea of which vertex causes trouble,
>>> but it's hard to say what's actually going on.
>>> 
>>> I've tried to implement a basic REST endpoint
>>> <
>>> 
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>>>> ,
>>> that prepares data for the flame graph rendering and it seems to be
>>> providing good insight.
>>> 
>>> It should be straightforward to render data from the endpoint in new UI
>>> using existing <https://github.com/spiermar/d3-flame-graph> javascript
>>> libraries.
>>> 
>>> WDYT? Is this worth pushing forward?
>>> 
>>> D.
>>> 
>>

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Reply via email to