Hi all there have been lots of discussions since the vote started and many suggestions have been made
Matthias and I had updated the FLIP-102 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager> following the suggestions and discussions I want to cancel the vote here and start a new one, thanks Matthias Pohl <matth...@ververica.com> 于2020年8月21日周五 上午3:33写道: > Good points, Andrey. Thanks for clarification. I made some minor > adaptations to the FLIP now: > - Renamed the `resource` member into `configuration` and made it a > top-level member besides `metrics` and `hardware` since it's not fitting > the volatile metrics context that well. > - I restructured the table under Proposed Changes to cover Metaspace now. > Additionally, I renamed `shuffle` into `network` to match the memory model > of FLIP-49. > - The UI in the screenshot picture has a bug: The counts of Direct and > Mapped are accompanied by a memory unit even though they are plain counts. > > On Thu, Aug 20, 2020 at 4:10 PM Andrey Zagrebin <azagre...@apache.org> > wrote: > > > Hi All, > > > > Thanks for reviving the discussion, Matthias! > > > > This would mean that we could adapt the current proposal to replace the > > > Nonheap usage pane by a pane displaying the Metaspace usage. > > > > > I do not know the value of having the Nonheap usage in metrics. I can see > > that the metaspace metric can be interesting for the users to debug OOMs. > > We had the Nonheap usage before, so as discussed, I would be a bit > careful > > removing. I believe it deserves a separate poll in user ML > > whether the Nonheap usage is useless or not. > > As a current solution, we could keep both or merge them into one box > with a > > slash, like Metaspace/Nonheap -> 5Mb/10Mb, if the majority agrees that > this > > is not confusing and clear that the metaspace is a part of Nonheap. > > > > That would be a good solution representing both metrics. I adapted the > table in FLIP-102's Confluence accordingly for now to have it visualized. > Let's see what others are thinking about it. > > > > > > Btw, the "Nonheap" in the configuration box of the current FLIP-102 is > > probably incorrect or confusing as it does not one-to-one correspond to > the > > Nonheap JVM metric. > > > > The only issue I see is that JVM Overhead would still not be represented > in > > > the memory usage > > > overview. > > > > My understanding is that we do not need a usage metric for JVM Overhead > as > > it is a virtual unmanaged component which is more about configuring the > max > > total process memory. > > > > Is there a reason for us to introduce a nested structure > > > TaskManagerMetricsInfo in the response object? I would rather keep it > > > consistent in a flat structure instead, i.e. having all the members of > > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo > > > > I would suggest introducing a separate REST call for > > TaskManagerResourceInfo. > > Semantically, TaskManagerResourceInfo is more about the TM configuration > > and it is not directly related to the usage metrics. > > In future, I would avoid having calls with many responsibilities and > maybe > > consider splitting the 'TM details' call into metrics etc unless there > is a > > concern for having to do more calls instead of one from UI. > > > > Good point. The growing size of the JSON response record might make it > worth splitting it up into different endpoints serving different groups of > data (e.g. /metrics for volatile values and /configuration for static > ones). > > > > > > Alternatively, one could think of grouping the metrics collecting the > > > different values (i.e. max, used, committed) per metric in a JSON > object. > > > But this would apply for all the other metrics of > TaskManagerMetricsInfo > > > as > > > well. > > > > I would personally prefer this for metrics but I am not pushing for this. > > > > metrics.resource.managedMemory and metrics.resource.networkMemory have > > > counterparts in metrics.networkMemory[Used|Total] and > > > metrics.managedMemory[Used|Total]: Is this redundant data or do they > have > > > different semantics? > > > > As I understand, they have different semantics. The later is about > > configuration, the former is about current usage metrics. > > > > I see. Makes sense. > > > > > Is metrics.resource.totalProcessMemory a basic sum over all provided > > > values? > > > > this is again about configuration, I do not think it makes sense to come > up > > with a usage metric for the totalProcessMemory component. > > > > Got it. > > > > Best, > > Andrey > > > > > > On Thu, Aug 20, 2020 at 9:06 AM Matthias <matth...@ververica.com> wrote: > > > > > Hi Jing, > > > I recently joined Ververica and started looking into FLIP-102. I'm > trying > > > to > > > figure out how we would implement the proposal on the backend side. > > > I looked into the proposal for the REST API response and a few > questions > > > popped up: > > > - Is there a reason for us to introduce a nested structure > > > TaskManagerMetricsInfo in the response object? I would rather keep it > > > consistent in a flat structure instead, i.e. having all the members of > > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo. > > > Alternatively, one could think of grouping the metrics collecting the > > > different values (i.e. max, used, committed) per metric in a JSON > object. > > > But this would apply for all the other metrics of > TaskManagerMetricsInfo > > as > > > well. > > > - metrics.resource.managedMemory and metrics.resource.networkMemory > have > > > counterparts in metrics.networkMemory[Used|Total] and > > > metrics.managedMemory[Used|Total]: Is this redundant data or do they > have > > > different semantics? > > > - Is metrics.resource.totalProcessMemory a basic sum over all provided > > > values? I see the necessity to have this member if we decide to not > > provide > > > the memory usage for all memory pools (e.g. providing Metaspace but > > leaving > > > Code Cache and Compressed Class Space as Non-Heap pools out of the > > > response). Otherwise, would it be worth it to remove this member from > the > > > response for simplicity reasons since we could sum up the memory on the > > > frontend side? > > > > > > Best, > > > Matthias > > > > > > > > > > > > -- > > > Sent from: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > -- > > Matthias Pohl | Engineer > > Follow us @VervericaData Ververica <https://www.ververica.com/> > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner >