Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie Thu, 22 Oct 2020 02:36:51 -0700

Hi all

there have been lots of discussions since the vote started and many
suggestions have been made


Matthias and I had updated the FLIP-102
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager>
following the suggestions and discussions

I want to cancel the vote here and start a new one, thanks

Matthias Pohl <[email protected]> 于2020年8月21日周五 上午3:33写道：

> Good points, Andrey. Thanks for clarification. I made some minor
> adaptations to the FLIP now:
> - Renamed the `resource` member into `configuration` and made it a
> top-level member besides `metrics` and `hardware` since it's not fitting
> the volatile metrics context that well.
> - I restructured the table under Proposed Changes to cover Metaspace now.
> Additionally, I renamed `shuffle` into `network` to match the memory model
> of FLIP-49.
> - The UI in the screenshot picture has a bug: The counts of Direct and
> Mapped are accompanied by a memory unit even though they are plain counts.
>
> On Thu, Aug 20, 2020 at 4:10 PM Andrey Zagrebin <[email protected]>
> wrote:
>
> > Hi All,
> >
> > Thanks for reviving the discussion, Matthias!
> >
> > This would mean that we could adapt the current proposal to replace the
> > > Nonheap usage pane by a pane displaying the Metaspace usage.
> > >
> > I do not know the value of having the Nonheap usage in metrics. I can see
> > that the metaspace metric can be interesting for the users to debug OOMs.
> > We had the Nonheap usage before, so as discussed, I would be a bit
> careful
> > removing. I believe it deserves a separate poll in user ML
> > whether the Nonheap usage is useless or not.
> > As a current solution, we could keep both or merge them into one box
> with a
> > slash, like Metaspace/Nonheap -> 5Mb/10Mb, if the majority agrees that
> this
> > is not confusing and clear that the metaspace is a part of Nonheap.
> >
>
> That would be a good solution representing both metrics. I adapted the
> table in FLIP-102's Confluence accordingly for now to have it visualized.
> Let's see what others are thinking about it.
>
>
> >
> > Btw, the "Nonheap" in the configuration box of the current FLIP-102 is
> > probably incorrect or confusing as it does not one-to-one correspond to
> the
> > Nonheap JVM metric.
> >
> > The only issue I see is that JVM Overhead would still not be represented
> in
> > > the memory usage
> > > overview.
> >
> > My understanding is that we do not need a usage metric for JVM Overhead
> as
> > it is a virtual unmanaged component which is more about configuring the
> max
> > total process memory.
> >
> > Is there a reason for us to introduce a nested structure
> > > TaskManagerMetricsInfo in the response object? I would rather keep it
> > > consistent in a flat structure instead, i.e. having all the members of
> > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo
> >
> > I would suggest introducing a separate REST call for
> > TaskManagerResourceInfo.
> > Semantically, TaskManagerResourceInfo is more about the TM configuration
> > and it is not directly related to the usage metrics.
> > In future, I would avoid having calls with many responsibilities and
> maybe
> > consider splitting the 'TM details' call into metrics etc unless there
> is a
> > concern for having to do more calls instead of one from UI.
> >
>
> Good point. The growing size of the JSON response record might make it
> worth splitting it up into different endpoints serving different groups of
> data (e.g. /metrics for volatile values and /configuration for static
> ones).
>
>
> >
> > Alternatively, one could think of grouping the metrics collecting the
> > > different values (i.e. max, used, committed) per metric in a JSON
> object.
> > > But this would apply for all the other metrics of
> TaskManagerMetricsInfo
> > > as
> > > well.
> >
> > I would personally prefer this for metrics but I am not pushing for this.
> >
> > metrics.resource.managedMemory and metrics.resource.networkMemory have
> > > counterparts in metrics.networkMemory[Used|Total] and
> > > metrics.managedMemory[Used|Total]: Is this redundant data or do they
> have
> > > different semantics?
> >
> > As I understand, they have different semantics. The later is about
> > configuration, the former is about current usage metrics.
> >
>
> I see. Makes sense.
>
> >
> > Is metrics.resource.totalProcessMemory a basic sum over all provided
> > > values?
> >
> > this is again about configuration, I do not think it makes sense to come
> up
> > with a usage metric for the totalProcessMemory component.
> >
>
> Got it.
>
>
> > Best,
> > Andrey
> >
> >
> > On Thu, Aug 20, 2020 at 9:06 AM Matthias <[email protected]> wrote:
> >
> > > Hi Jing,
> > > I recently joined Ververica and started looking into FLIP-102. I'm
> trying
> > > to
> > > figure out how we would implement the proposal on the backend side.
> > > I looked into the proposal for the REST API response and a few
> questions
> > > popped up:
> > > - Is there a reason for us to introduce a nested structure
> > > TaskManagerMetricsInfo in the response object? I would rather keep it
> > > consistent in a flat structure instead, i.e. having all the members of
> > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo.
> > >   Alternatively, one could think of grouping the metrics collecting the
> > > different values (i.e. max, used, committed) per metric in a JSON
> object.
> > > But this would apply for all the other metrics of
> TaskManagerMetricsInfo
> > as
> > > well.
> > > - metrics.resource.managedMemory and metrics.resource.networkMemory
> have
> > > counterparts in metrics.networkMemory[Used|Total] and
> > > metrics.managedMemory[Used|Total]: Is this redundant data or do they
> have
> > > different semantics?
> > > - Is metrics.resource.totalProcessMemory a basic sum over all provided
> > > values? I see the necessity to have this member if we decide to not
> > provide
> > > the memory usage for all memory pools (e.g. providing Metaspace but
> > leaving
> > > Code Cache and Compressed Class Space as Non-Heap pools out of the
> > > response). Otherwise, would it be worth it to remove this member from
> the
> > > response for simplicity reasons since we could sum up the memory on the
> > > frontend side?
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >
> >
>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica <https://www.ververica.com/>
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Reply via email to