I've created FLINK-13550 <https://issues.apache.org/jira/browse/FLINK-13550>
to track the issue.

Is there any committer who'd be willing to "shepherd this effort"? :)

Thanks,
D.

On Fri, Aug 2, 2019 at 10:22 AM David Morávek <d...@apache.org> wrote:

> Hi Paul, for now I only plan to add the one based on java stack traces.
>
> On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <paullin3...@gmail.com> wrote:
>
>> Hi David,
>>
>> Thanks for the new feature! I think the flame graph would be a useful
>> tool to understand the state of job executions, and it looks good too. +1
>> for this.
>>
>> And a minor question: do we plan to support multiple kinds of flame
>> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>>
>> Best,
>> Paul Lam
>>
>> > 在 2019年8月2日,04:24,David Morávek <david.mora...@gmail.com> 写道:
>> >
>> > Hi Till, thanks for the feedback! These endpoints are only called when
>> the
>> > vertex is selected in the UI, so there should be any heavy RPC load. For
>> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
>> the
>> > flame-graph, we want to sample the whole stack trace and we need
>> different
>> > sampling rate (longer period, more samples). Those are the main reasons
>> to
>> > split these in two "trackers", but I may be missing something.
>> >
>> > I've prepared a little demo, so others can have a better idea of what I
>> > have in mind.
>> >
>> > https://youtu.be/GUNDehj9z9o
>> >
>> > Please note that this is a proof of concept and I'm not frontend
>> person, so
>> > it may look little clumsy :)
>> >
>> > D.
>> >
>> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>> >
>> >> Hi David,
>> >>
>> >> thanks for starting this discussion. I like the idea of improving
>> insights
>> >> into Flink's execution and I believe that a flame graph could be
>> helpful.
>> >>
>> >> I quickly glanced over your changes and I think they go in a good
>> >> direction. One idea could be to share the `StackTraceSample` produced
>> by
>> >> the `StackTraceSampleCoordinator` between the different
>> >> `StackTraceOperatorTracker` so that we don't send multiple requests
>> for the
>> >> same operators. That way we would decrease a bit the RPC load.
>> >>
>> >> Apart from that, I think the next steps would be to find a committer
>> who
>> >> could shepherd this effort and help you with merging it.
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <d...@apache.org> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> While looking into Flink internals, I've noticed that there is
>> already a
>> >>> mechanism for stack-trace sampling of a particular job vertex.
>> >>>
>> >>> I think it may be really useful to allow user to easily render a cpu
>> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> >> for
>> >>> a
>> >>> selected vertex (new tab next to back pressure) of a running job. Back
>> >>> pressure tab already provides a good idea of which vertex causes
>> trouble,
>> >>> but it's hard to say what's actually going on.
>> >>>
>> >>> I've tried to implement a basic REST endpoint
>> >>> <
>> >>>
>> >>
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>> >>>> ,
>> >>> that prepares data for the flame graph rendering and it seems to be
>> >>> providing good insight.
>> >>>
>> >>> It should be straightforward to render data from the endpoint in new
>> UI
>> >>> using existing <https://github.com/spiermar/d3-flame-graph>
>> javascript
>> >>> libraries.
>> >>>
>> >>> WDYT? Is this worth pushing forward?
>> >>>
>> >>> D.
>> >>>
>> >>
>>
>>

Reply via email to