Hi Chesnay, > The new option must default to the configured value of metrics.scope.jm.job because otherwise it isn't backwards-compatible with existing coordinator metrics. IIUC, there is only an OperatorCoordinatorMetricGroup interface without implementations (it never worked). There were no coordinator metrics implemented and registered. @Hang Ruan <ruanhang1...@gmail.com> please correct me if I'm wrong.
> Intuitively I think the config key should be metrics.scope.jm.operator., similar to metrics.scope.jm.job. However, the config key "metrics.scope.jm.job" is deprecated and is replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is replaced with "metrics.scope.tm-job". That's why I though "metrics.scope.jm-operator" might be more consistent. > We would only support operator_name, operator_id, task_name and task_id (+ everything else from the jm/job groups of course). +1 Best, Jark On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ches...@apache.org> wrote: > The new option must default to the configured value of > metrics.scope.jm.job because otherwise it isn't backwards-compatible with > existing coordinator metrics. > According to some earlier response in this thread there are already > coordinator metrics. > > If that weren't the case you'd be correct, and we'd keep the default > consistent with the default for tm operator metrics. > > Intuitively I think the config key should be metrics.scope.jm.operator., > similar to metrics.scope.jm.job. > For symmetry we should also add new metrics.scope.tm.operator/task and > eventually phase out metrics.scope.operator. > > > Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I > mean the coordinator doesn't run on specific tasks, so it doesn't have > runtime information of tasks. > > That is true. I intentionally only referred to task id/name, as things > like subtask index or attempt ids don't make sense for coordinators as you > said. > We would only support operator_name, operator_id, task_name and task_id (+ > everything else from the jm/job groups of course). > (Maybe in the future attempt_num could be interesting though) > > On 30/01/2023 10:39, Jark Wu wrote: > > Hi Chesnay, > > IIUC, our only disagreement is whether to support task info in the metric > format. But we all agree with the default metric format > "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>" > which doesn't include task info, right? > > Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I > mean the coordinator doesn't run on specific tasks, so it doesn't have > runtime information of tasks. My concern about adding task info to the > coordinator metric is that > 1) if we register coordinator metrics for every subtask, the number of > coordinator metrics will be exploded when task parallelism is large. This > is an overhead to JM & metric systems, and the duplicated metrics are > useless for users. > 2) Correct me if I'm wrong, MetricGroup#counter only supports registering > one metric. It seems it can't register a metric reporting for multiple > metric names (e.g. multiple subtasks). > > Best, > Jark > > > On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ches...@apache.org> wrote: > >> I hadn't looked in detail in to how the task info can be introduced, but >> given that the coordinates are created by the execution graph, where we >> only work on tasks, it should be possible and rather simple. >> >> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from >> which it can extract everything and create the task/operator groups. >> Exposure to the coordinator could happen via the coordinator context as >> originally planned. >> >> > As we know, coordinator metrics are just like JMJobMetricGroup, which >> don't belong to any tasks/vertexes. >> >> I don't think this is true, every coordinator is scoped to a specific >> operator, and every operator is associated with a particular vertex. >> >> From the javadocs: >> >> " A coordinator for runtime operators. The OperatorCoordinator runs on >> the master, associated with >> the job vertex of the operator. It communicates with operators via >> sending operator events." >> >> On 28/01/2023 09:15, Jark Wu wrote: >> > Hi all, >> > >> > IIUC, Chesnay means we should have a more general metric group for >> > **operator** >> > not for **coordinator** in JM, this would be useful to extend >> > other operator-specific >> > metrics in the future. That means the new scope format should be >> designed >> > for >> > the operator, >> > e.g., >> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name> >> > The coordinator metric is a subgroup (a constant "coordinator" suffix) >> of >> > the JMOperatorMG. >> > >> > I think this is a nice design. However, I have a question about adding >> task >> > id/name to this list. >> > How to get the task id/name when reporting a coordinator metric? As we >> > know, coordinator metrics >> > are just like JMJobMetricGroup, which don't belong to any >> tasks/vertexes. >> > Do you mean reporting >> > coordinator metrics for every task id under the operator? >> > >> > Best, >> > Jark >> > >> > >> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org> >> wrote: >> > >> >> > First, I do not understand why users have to configure the new >> scope >> >> format, which has a default value. >> >> >> >> If you don't use scope formats, sure. If you do use scope formats, e.g. >> >> to add a common prefix (which is the case for datadog users for >> >> example), then the current default in the FLIP is insufficient and >> >> requires the user to update the configuration. >> >> >> >> > I usually do not need to change these configuration of scope >> formats >> >> when submitting the flink job. >> >> >> >> Scope formats as a whole are quite a power-user feature, but that >> >> doesn't mean we should ignore it. >> >> >> >> > I try to let it extend the ComponentMetricGroup >> >> >> >> This isn't inherently required just because it is a "component". >> >> Component metric groups should _only_ be used for cases where such a >> >> component provides several bits of metadata. >> >> For example, tasks provide vertex ids, task names, attempted IDs etc., >> >> and we'd like users to have the option on how this metadata ends up >> >> being used (via scope formats). This currently can't be built with the >> >> addGroup() methods, hence why the component groups exist. >> >> However, the operator coordinator _itself_does not provide several bits >> >> of metadata. Logically, the _operator_ does, not the coordinator. >> >> >> >> > This make me decide to add a new scope format. >> >> >> >> And this is fine; just don't add a new scope format for the coordinator >> >> but logically for operators on the JM side, such that we can extend the >> >> set of operator-specific metrics in the future without having to add >> yet >> >> another scope format. >> >> >> >> > The relationship between the Operator and OperatorCoordinator is >> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator >> >> name/id could be found and the OperatorCoordinatorMetricGroup will be >> >> created here. >> >> >> >> That's all irrelevant. >> >> >> >> > The registered metrics of OperatorCoordinator always are those >> which >> >> can not be aggregated from the tasks, like the number of unassigned >> >> splits in the SourceCoordinator. Actually we have not encountered the >> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup >> >> should only contain the metrics for itself instead of its subtasks. >> >> >> >> You are misunderstanding the problem. >> >> >> >> This isn't about aggregating metrics on our side or exposing metrics >> >> from TMs on the JM side or anything like that. >> >> >> >> It's purely about the available metadata for operator metrics on the JM >> >> side. Currently you suggest operator name / id; what I'm proposing is >> to >> >> add task(==vertex!) id / name to this list. >> >> >> >> The benefit here is simple. I can look up all metrics where >> task_id==XXX >> >> and get _everything_ related to that vertex, including all metrics >> >> associated with operators that are part of that vertex, including the >> >> coordinator. >> >> >> >> > Using metrics.scope.jm.job is not enough to distinguish the >> different >> >> coordinators. >> >> >> >> I did not suggest just using metrics.scope.jm.job. I suggested that as >> >> the default for the jm operator scope format, that is used for the >> >> proposed JobManagerOperatorMetricGroup, which will have some plain >> >> metric groups as children for coordinators. >> >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators") >> >> and then addGroup(coordinator_name) for each coordinator. >> >> >> >> > So there are two choice for the >> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new >> scope >> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one >> >> is better? >> >> >> >> There are more options than that. >> >> I feel like there are a lot of misunderstandings in this discussion. >> >> Please let me know if I could clear things up. If not I can also >> provide >> >> a PoC based on your PoC if that can speed things up. It's not that >> >> different anyway. >> >> >> >> On 19/01/2023 07:39, Hang Ruan wrote: >> >>> Hi, chesnay, >> >>> >> >>> Thanks for your reply. I still have some doubts about the questions >> >>> you raised. >> >>> >> >>> > Extending the set of ScopeFormats is problematic because it in >> >>> practice >> >>> it breaks the config if users actively rely on it, since there's >> now >> >>> another key that they _must_ set for it to be >> >>> consistent/compatible with >> >>> their existing setup. >> >>> Unfortunately due to how powerful scope formats are we can't >> derive a >> >>> default value that matches their existing setup. >> >>> Hence we should try to do this as rarely as possible. >> >>> >> >>> >> >>> > This FLIP does not adhere to that since it proposes a dedicated >> >>> format >> >>> for coordinators; next time we want to expose operator-specific >> >>> metrics >> >>> (e.g., in the scheduler) we'd have to add another one to support >> it. >> >>> >> >>> >> >>> First, I do not understand why users have to configure the new scope >> >>> format, which has a default value. I usually do not need to change >> >>> these configuration of scope formats when submitting the flink job. >> >>> IMO, the OperatorCoordinator is a component running at the JobMaster. >> >>> I try to let it extend the ComponentMetricGroup, but I cannot find >> >>> a suitable scope. This make me decide to add a new scope format. >> >>> >> >>> > Additionally, the configurable variables (operator name/id) are >> >>> logically not attached to the coordinator, but operators, so to >> me it >> >>> just doesn't make sense to structure it like this. >> >>> >> >>> >> >>> The relationship between the Operator and OperatorCoordinator is >> >>> maintained in the OperatorCoordinatorHolder. In the POC, the operator >> >>> name/id could be found and the OperatorCoordinatorMetricGroup will be >> >>> created here. >> >>> >> >>> > Another thing I'm concerned about is that, because we don't >> include >> >>> tasks in the hierarchy, users wishing to collect all metrics for >> a >> >>> particular task (in this case ==vertex) now have to go >> >>> significantly out >> >>> of their way to get them, since they can no longer just filter >> by the >> >>> task ID but have to be filter for _all_ operators that are part >> of >> >>> the task. >> >>> >> >>> >> >>> The registered metrics of OperatorCoordinator always are those which >> >>> can not be aggregated from the tasks, like the number of unassigned >> >>> splits in the SourceCoordinator. Actually we have not encountered the >> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup >> >>> should only contain the metrics for itself instead of its subtasks. >> >>> >> >>> Using metrics.scope.jm.job is not enough to distinguish the different >> >>> coordinators. The operator id/name is necessary. Then >> >>> the implementation should be like this. But users can not change the >> >>> scope in this way. This is also acceptable. >> >>> @Internal public class InternalOperatorCoordinatorMetricGroup >> >>> extends ProxyMetricGroup<MetricGroup> >> >>> implements OperatorCoordinatorMetricGroup { >> >>> public InternalOperatorCoordinatorMetricGroup( >> >>> JobManagerJobMetricGroup parent, OperatorID operatorID, >> >> String operatorName) { >> >>> super(parent.addGroup(operatorID + >> >> operatorName).addGroup("coordinator")); } >> >>> } >> >>> So there are two choice for the >> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new >> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. >> >>> Which one is better? >> >>> >> >>> Best, >> >>> Hang >> >>> >> >>> >> >>> Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道: >> >>> >> >>> You're misunderstanding the problem. >> >>> >> >>> Metric groups form a tree with each group providing certain >> metadata. >> >>> >> >>> E.g., on the taskmanager we have a TM metric group that provides >> info >> >>> about the TM, that has child task metric groups, that have child >> >>> operator metric groups etc. The operator metric group again has >> >>> children, where we are now mostly in the user/connector land, >> like a >> >>> kafka metric group providing partition info or something. >> >>> >> >>> What is being proposed is to create a coordinator metric group on >> >>> the JM >> >>> side that provides operator metadata. >> >>> >> >>> A more appropriate structure is to create an operator group on >> the JM >> >>> side that provides this info, with the coordinator metric group >> >>> being a >> >>> child. >> >>> >> >>> On 17/01/2023 19:56, Steven Wu wrote: >> >>> >> Additionally, the configurable variables (operator name/id) >> are >> >>> > logically not attached to the coordinator, but operators, so to >> >>> me it >> >>> > just doesn't make sense to structure it like this. >> >>> > >> >>> > Chesnay, maybe we should clarify the terminology. To me, >> >>> pperators (like >> >>> > FLIP-27 source) can have two parts (coordinator and >> >>> reader/subtask). I >> >>> > think it is fine to include operator name/id for coordinator >> >>> metrics. >> >>> > >> >>> > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler >> >>> <ches...@apache.org> wrote: >> >>> > >> >>> >> Slight correction: Using metrics.scope.jm.job as the default >> >>> should be >> >>> >> safe. >> >>> >> >> >>> >> On 16/01/2023 10:18, Chesnay Schepler wrote: >> >>> >>> The proposed ScopeFormat is still problematic for a few >> reasons. >> >>> >>> >> >>> >>> Extending the set of ScopeFormats is problematic because it >> in >> >>> >>> practice it breaks the config if users actively rely on it, >> since >> >>> >>> there's now another key that they _must_ set for it to be >> >>> >>> consistent/compatible with their existing setup. >> >>> >>> Unfortunately due to how powerful scope formats are we can't >> >>> derive a >> >>> >>> default value that matches their existing setup. >> >>> >>> Hence we should try to do this as rarely as possible. >> >>> >>> >> >>> >>> This FLIP does not adhere to that since it proposes a >> >>> dedicated format >> >>> >>> for coordinators; next time we want to expose >> operator-specific >> >>> >>> metrics (e.g., in the scheduler) we'd have to add another >> one to >> >>> >>> support it. >> >>> >>> >> >>> >>> Additionally, the configurable variables (operator name/id) >> are >> >>> >>> logically not attached to the coordinator, but operators, so >> >>> to me it >> >>> >>> just doesn't make sense to structure it like this. >> >>> >>> >> >>> >>> Another thing I'm concerned about is that, because we don't >> >>> include >> >>> >>> tasks in the hierarchy, users wishing to collect all metrics >> for >> >> a >> >>> >>> particular task (in this case ==vertex) now have to go >> >>> significantly >> >>> >>> out of their way to get them, since they can no longer just >> >>> filter by >> >>> >>> the task ID but have to be filter for _all_ operators that >> are >> >>> part of >> >>> >>> the task. >> >>> >>> >> >>> >>> On 16/01/2023 03:09, Hang Ruan wrote: >> >>> >>>> Hi, @ches...@apache.org <ches...@apache.org> , >> >>> >>>> >> >>> >>>> Do you have time to help to review this FLIP again after the >> >>> >>>> modification? >> >>> >>>> Looking forward to your reply. >> >>> >>>> This FLIP will add a new configuration for the >> >>> >>>> OperatorCoordinatorMetricGroup scope format. It provides an >> >>> internal >> >>> >>>> implementation and is added as a component to the >> >>> >>>> JobManagerJobMetricGroup. >> >>> >>>> If something doesn't make sense, could you provide some >> >>> advice? It >> >>> >>>> will be >> >>> >>>> very helpful. Thanks a lot for your help. >> >>> >>>> >> >>> >>>> Best, >> >>> >>>> Hang >> >>> >>>> >> >>> >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 >> >>> 16:34写道: >> >>> >>>> >> >>> >>>> >> >>> >>>>> Hi Hang, >> >>> >>>>> >> >>> >>>>> I'm a bit surprised that this has gone to a vote, given >> that >> >>> Chesnay >> >>> >>>>> deliberately mentioned that he would vote against it as-is. >> >>> I would >> >>> >>>>> expect >> >>> >>>>> that before going to a vote, he has had the opportunity to >> >>> >>>>> participate in >> >>> >>>>> this discussion. >> >>> >>>>> >> >>> >>>>> Best regards, >> >>> >>>>> >> >>> >>>>> Martijn >> >>> >>>>> >> >>> >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com> >> >>> wrote: >> >>> >>>>> >> >>> >>>>>> Hi Dong, >> >>> >>>>>> >> >>> >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only >> >>> concern is >> >>> >>>>>> that >> >>> >>>>>> this is a core interface for the FLIP. It’s hard to tell >> how >> >>> >>>>>> sources use >> >>> >>>>>> metric group without mentioning this interface. Even if >> >>> this is an >> >>> >>>>> existing >> >>> >>>>>> API, I think it’s worth introducing the interface again >> and >> >>> declaring >> >>> >>>>> that >> >>> >>>>>> we will implement the interface instead of a no-op method >> >>> in this >> >>> >>>>>> FLIP. >> >>> >>>>>> >> >>> >>>>>> Anyway, this is a minor problem and shouldn’t block this >> >>> FLIP. I’m >> >>> >>>>>> +1 to >> >>> >>>>>> start a vote. >> >>> >>>>>> >> >>> >>>>>> Best, >> >>> >>>>>> Jark >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>>> 2023年1月3日 10:03,Hang Ruan <ruanhang1...@gmail.com> 写道: >> >>> >>>>>>> >> >>> >>>>>>> Hi, Jark and Dong, >> >>> >>>>>>> >> >>> >>>>>>> Thanks for your comments. Sorry for my late reply. >> >>> >>>>>>> >> >>> >>>>>>> For suggestion 1, I plan to implement the >> >>> >>>>>>> SplitEnumeratorMetricGroup in >> >>> >>>>>>> another issue, and it is not contained in this FLIP. I >> >>> will add some >> >>> >>>>>>> description about this part. >> >>> >>>>>>> For suggestion 2, changes about >> >>> OperatorCoordinator#metricGroup has >> >>> >>>>>> already >> >>> >>>>>>> been documented in the proposed change section. >> >>> >>>>>>> >> >>> >>>>>>> Best, >> >>> >>>>>>> Hang >> >>> >>>>>>> >> >>> >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道: >> >>> >>>>>>> >> >>> >>>>>>>> Let me chime-in and add comments regarding the public >> >>> interface >> >>> >>>>> section. >> >>> >>>>>>>> Please see my comments inline. >> >>> >>>>>>>> >> >>> >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu >> >>> <imj...@gmail.com> wrote: >> >>> >>>>>>>> >> >>> >>>>>>>>> Hi Hang, >> >>> >>>>>>>>> >> >>> >>>>>>>>> Thanks for driving this discussion. I think this is a >> >>> very useful >> >>> >>>>>> feature >> >>> >>>>>>>>> for connectors. >> >>> >>>>>>>>> >> >>> >>>>>>>>> The FLIP looks quite good to me, and I just have two >> >>> suggestions. >> >>> >>>>>>>>> >> >>> >>>>>>>>> 1. In the "Public Interface" section, mention that the >> >>> >>>>>>>>> implementation >> >>> >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is >> >>> changed from >> >>> >>>>>>>> returning >> >>> >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup >> >>> instance. >> >>> >>>>> Even >> >>> >>>>>>>>> though the API is already there, the behavior change >> can >> >>> also be >> >>> >>>>>>>> considered >> >>> >>>>>>>>> a public change. >> >>> >>>>>>>>> >> >>> >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and >> >>> this FLIP >> >>> >>>>>>>> does >> >>> >>>>>> not >> >>> >>>>>>>> seem to change its semantics/behavior. The FLIP does >> >>> change the >> >>> >>>>>>>> implementation/behavior of >> >>> SourceCoordinatorContext#metricGroup, >> >>> >>>>>>>> which >> >>> >>>>>> is >> >>> >>>>>>>> marked @Internal. >> >>> >>>>>>>> >> >>> >>>>>>>> Thus it might seem a bit weird to add in the public >> >> interface >> >>> >>>>>>>> section >> >>> >>>>>>>> saying that we change the interface >> >>> >>>>>>>> SplitEnumeratorContext#metricGroup >> >>> >>>>>> from >> >>> >>>>>>>> returning null to non-null object. >> >>> >>>>>>>> >> >>> >>>>>>>> 2. Mention the newly added interface of >> >>> >>>>>> "OperatorCoordinator#metricGroup" >> >>> >>>>>>>>> in the "Proposed Changes" section or "Public Interface" >> >>> section. As >> >>> >>>>> the >> >>> >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many >> >>> connectors. >> >>> >>>>>> Though >> >>> >>>>>>>> it >> >>> >>>>>>>>> is still an @Internal API, I think it is worth >> >>> mentioning the >> >>> >>>>>>>>> change >> >>> >>>>> in >> >>> >>>>>>>> the >> >>> >>>>>>>>> FLIP. >> >>> >>>>>>>> Since OperatorCoordinator is an internal API, it seems >> >>> reasonable to >> >>> >>>>>>>> explain it in the proposed change section. The FLIP >> seems >> >>> to have >> >>> >>>>>>>> documented this in point 5 of the proposed change >> section. >> >>> >>>>>>>> >> >>> >>>>>>>> BTW, if we think there are @internal classes that are >> >>> important >> >>> >>>>>>>> enough >> >>> >>>>>> to >> >>> >>>>>>>> be added in the public interface section, it might be >> >>> useful to >> >>> >>>>>> explicitly >> >>> >>>>>>>> discuss this topic and document it in the "*What are the >> >>> "public >> >>> >>>>>>>> interfaces" of the project*" in this >> >>> >>>>>>>> < >> >>> >>>>>>>> >> >>> >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals >> >>> >>>>>>>> wiki. >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>>> Best, >> >>> >>>>>>>>> Jark >> >>> >>>>>>>>> >> >>> >>>>>>>>> >> >>> >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan >> >>> <ruanhang1...@gmail.com> >> >>> >>>>>> wrote: >> >>> >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng. >> >>> >>>>>>>>>> After combining everyone's comments, the main >> concerns and >> >>> >>>>>>>> corresponding >> >>> >>>>>>>>>> adjustments are as follows. >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> Q1: Common metrics are not quite useful. >> >>> >>>>>>>>>> numEventsIn and numEventsOut counters will be removed >> >>> from the >> >>> >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics >> do >> >> not >> >>> >>>>>>>>>> provide >> >>> >>>>>>>>> enough >> >>> >>>>>>>>>> information for users. The users are more willing to >> >>> get the >> >>> >>>>>>>>>> number >> >>> >>>>> of >> >>> >>>>>>>>>> events of the specified type instead of the total >> >>> number. And this >> >>> >>>>>>>> metric >> >>> >>>>>>>>>> is calculated differently. The implementation could >> >>> register the >> >>> >>>>>> metric >> >>> >>>>>>>>> by >> >>> >>>>>>>>>> themselves. >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> Q2: This FLIP is overly complicated. >> >>> >>>>>>>>>> This FLIP will become concise after these >> modifications. >> >>> >>>>>>>>>> OperatorCoordinatorMetricGroup has already been >> >>> introduced into >> >>> >>>>> Flink >> >>> >>>>>>>> by >> >>> >>>>>>>>>> FLIP-179< >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics >> >>> >>>>>>>>>>> . >> >>> >>>>>>>>>> And >> >>> >>>>>>>>>> this FLIP will not change it. This FLIP only provides >> a >> >>> new metric >> >>> >>>>>>>> option >> >>> >>>>>>>>>> and a new metric group scope. The changes in proposed >> >>> changes >> >>> >>>>> provide >> >>> >>>>>>>> the >> >>> >>>>>>>>>> details about the modifications for the internal >> >>> classes, which >> >>> >>>>> might >> >>> >>>>>>>>> make >> >>> >>>>>>>>>> it look complicated. >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> Thanks for all the comments again. If there are no >> further >> >>> >>>>>>>>>> comments, >> >>> >>>>>> we >> >>> >>>>>>>>>> plan to start the voting thread this week. >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> Best, >> >>> >>>>>>>>>> Hang >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一 >> >>> 16:48写道: >> >>> >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> >>>>>>>>>>> Thanks for the FLIP, Hang! >> >>> >>>>>>>>>>> >> >>> >>>>>>>>>>> This FLIP overall looks good to me. Actually I share >> >>> the same >> >>> >>>>> concern >> >>> >>>>>>>>>> with >> >>> >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are >> not >> >>> quite >> >>> >>>>>>>>>>> useful >> >>> >>>>>>>> to >> >>> >>>>>>>>>> end >> >>> >>>>>>>>>>> users. OperatorEvent is a quite low-level >> abstraction, >> >>> which >> >>> >>>>> requires >> >>> >>>>>>>>>>> instantialization in order to be practical to users >> and >> >>> >>>>>>>>>>> developers, >> >>> >>>>>>>> so >> >>> >>>>>>>>>>> maybe it's better to exclude them from the FLIP. >> >>> >>>>>>>>>>> >> >>> >>>>>>>>>>> Best, >> >>> >>>>>>>>>>> Qingsheng >> >>> >>>>>>>>>>> Ververica (Alibaba) >> >>> >>>>>>>>>>> >> >>> >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu >> >>> <reed...@gmail.com> >> >>> >>>>> wrote: >> >>> >>>>>>>>>>>> Hi Hang, >> >>> >>>>>>>>>>>> I still see no strong reason why we need >> >>> >>>>>>>>>>>> numEventsIn/numEventsOut >> >>> >>>>>>>>>>> metrics. >> >>> >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same >> >>> concern >> >>> >>>>>>>>>>>> from >> >>> >>>>>>>>>> others. >> >>> >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid >> >>> >>>>>>>>>>>> over-extending >> >>> >>>>>>>>> the >> >>> >>>>>>>>>>>> scope. >> >>> >>>>>>>>>>>> >> >>> >>>>>>>>>>>> Thanks, >> >>> >>>>>>>>>>>> Zhu >> >>> >>>>>>>>>>>> >> >>> >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五 >> >>> >>>>>>>>>>>> 15:21写道: >> >>> >>>>>>>>>>>>> Hi everyone, >> >>> >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should >> keep >> >>> the common >> >>> >>>>>>>>>>> metrics, >> >>> >>>>>>>>>>>> which will help to observe incoming and outgoing >> >>> events. What do >> >>> >>>>>>>> you >> >>> >>>>>>>>>>> think, >> >>> >>>>>>>>>>>> @Zhu Zhu ? >> >>> >>>>>>>>>>>>> And @Chesnay, are there any other issues you are >> >>> more concerned >> >>> >>>>>>>>>> about? >> >>> >>>>>>>>>>>> Looking forward to your reply. >> >>> >>>>>>>>>>>>> Thanks for all the comments. If there are no >> further >> >>> >>>>>>>>>>>>> comments, we >> >>> >>>>>>>>>> plan >> >>> >>>>>>>>>>>> to start the voting thread next week. >> >>> >>>>>>>>>>>>> Best, >> >>> >>>>>>>>>>>>> Hang >> >>> >>>>>>>>>>>>> >> >>> >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四 >> >>> >>>>>>>>>>>>> 16:49写道: >> >>> >>>>>>>>>>>>>> Hi, Zhu Zhu, >> >>> >>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>> Thanks for your feedback! >> >>> >>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>> The OperatorCoordinator implementations are >> >>> different. And >> >>> >>>>>>>>>>>>>> their >> >>> >>>>>>>>>>>> metrics are much different too. We try to find the >> >> common >> >>> >>>>>>>>>>>> metrics >> >>> >>>>>>>> and >> >>> >>>>>>>>>> put >> >>> >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most >> >>> developers >> >>> >>>>>>>> think >> >>> >>>>>>>>> we >> >>> >>>>>>>>>>> do >> >>> >>>>>>>>>>>> not need these common metrics, removing the common >> >>> metrics is >> >>> >>>>>>>>>>>> also >> >>> >>>>>>>>>>>> acceptable. >> >>> >>>>>>>>>>>>>> Best, >> >>> >>>>>>>>>>>>>> Hang >> >>> >>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三 >> >>> 22:09写道: >> >>> >>>>>>>>>>>>>>> Hi Hang & MengYue, >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> Thanks for creating this FLIP! >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects: >> >>> >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register >> metrics. >> >>> >>>>>>>>>>>>>>> Currently >> >>> >>>>>>>>>>>>>>> the coordinators has no way to do this. And >> operator >> >>> >>>>>>>> coordinator >> >>> >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator >> >>> to have >> >>> >>>>>>>>>>>>>>> access >> >>> >>>>>>>>>>>>>>> to a registered metric group (via the existing >> public >> >>> >>>>>>>>>>>>>>> interface >> >>> >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is >> >>> null at the >> >>> >>>>>>>>> moment. >> >>> >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator >> >>> metrics. A clear >> >>> >>>>>>>>>>>> definition >> >>> >>>>>>>>>>>>>>> makes it easy for users to find their wanted >> >>> metrics. The >> >>> >>>>>>>>>> definition >> >>> >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from >> >> multiple >> >>> >>>>>>>>>>>> OperatorCoordinators >> >>> >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may >> >>> have its >> >>> >>>>>>>>>>>>>>> own >> >>> >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not >> >>> be directly >> >>> >>>>>>>>>>> registered >> >>> >>>>>>>>>>>>>>> to the job metric group. >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the >> >>> introduced >> >>> >>>>>>>>> common >> >>> >>>>>>>>>>>> metrics >> >>> >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If >> there >> >>> any case >> >>> >>>>>>>> which >> >>> >>>>>>>>>>>> strongly >> >>> >>>>>>>>>>>>>>> requires them? >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> Regarding the concerns of Chesnay, >> >>> >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is >> >> overkill >> >>> >>>>>>>>>>>>>>> Directly using the job metric group can result in >> >>> metric >> >>> >>>>>>>>> conflicts, >> >>> >>>>>>>>>>> as >> >>> >>>>>>>>>>>> mentioned >> >>> >>>>>>>>>>>>>>> in above #2. >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> Thanks, >> >>> >>>>>>>>>>>>>>> Zhu >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六 >> >>> >>>>>>>>>>>>>>> 14:16写道: >> >>> >>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>> Hi Chesney, >> >>> >>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>> Just to double check with you, >> >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup >> >>> >>>>>>>>>>>> (annotated as >> >>> >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced >> into >> >>> Flink by >> >>> >>>>>>>>>> FLIP-179 >> >>> >>>>>>>>>>>>>>>> < >> >>> >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics >> >>> >>>>>>>>>>>>> . >> >>> >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we >> >>> should remove >> >>> >>>>>>>>> this >> >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup? >> >>> >>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>> Regards, >> >>> >>>>>>>>>>>>>>>> Dong >> >>> >>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay >> Schepler < >> >>> >>>>>>>>>>> ches...@apache.org> >> >>> >>>>>>>>>>>> wrote: >> >>> >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly >> >>> complicated. A >> >>> >>>>>>>>>>> dedicated >> >>> >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it >> >>> could just >> >>> >>>>>>>>> re-use >> >>> >>>>>>>>>>> the >> >>> >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same >> >>> structure we >> >>> >>>>>>>>> have >> >>> >>>>>>>>>> on >> >>> >>>>>>>>>>>> TMs, >> >>> >>>>>>>>>>>>>>>>> similar to what we did with the Job MG. >> >>> >>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>> However, I'm not convinced that this is >> required >> >>> anyway, >> >>> >>>>>>>>>> because >> >>> >>>>>>>>>>>> all the >> >>> >>>>>>>>>>>>>>>>> example metrics you listed can be implemented >> on >> >>> the TM >> >>> >>>>>>>> side >> >>> >>>>>>>>> + >> >>> >>>>>>>>>>>>>>>>> aggregating them in the external metrics >> backend. >> >>> >>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one >> tries >> >>> to pull a >> >>> >>>>>>>>> fast >> >>> >>>>>>>>>>>> one on >> >>> >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be >> >>> against it. >> >>> >>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote: >> >>> >>>>>>>>>>>>>>>>>> Hi Hang, >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and >> it >> >>> is pretty >> >>> >>>>>>>>>>>> informative. >> >>> >>>>>>>>>>>>>>>>>> I have just two minor comments regarding >> names: >> >>> >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key >> as >> >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* >> for >> >>> >>>>>>>> consistency >> >>> >>>>>>>>>>> with >> >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job >> >>> >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)? >> >>> >>>>>>>>>>>>>>>>>> - Maybe rename the variable as >> >>> >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR >> >>> >>>>>>>>>>>> for >> >>> >>>>>>>>>>>>>>>>>> simplicity and consistency with >> >>> SCOPE_NAMING_OPERATOR >> >>> >>>>>>>>> (which >> >>> >>>>>>>>>> is >> >>> >>>>>>>>>>>> not named >> >>> >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)? >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> Cheers, >> >>> >>>>>>>>>>>>>>>>>> Dong >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan < >> >>> >>>>>>>>>>> ruanhang1...@gmail.com> >> >>> >>>>>>>>>>>> wrote: >> >>> >>>>>>>>>>>>>>>>>>> Hi all, >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce >> >>> metric group >> >>> >>>>>>>>> for >> >>> >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is >> the >> >>> >>>>>>>>> coordinator >> >>> >>>>>>>>>>> for >> >>> >>>>>>>>>>>> runtime >> >>> >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The >> >>> coordination >> >>> >>>>>>>>>>>> mechanism is >> >>> >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator >> >>> and its all >> >>> >>>>>>>>>>>> operators, the >> >>> >>>>>>>>>>>>>>>>>>> coordination is more and more using in >> Flink, for >> >>> >>>>>>>> example >> >>> >>>>>>>>>> many >> >>> >>>>>>>>>>>> Sources >> >>> >>>>>>>>>>>>>>>>> and >> >>> >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign >> splits >> >> and >> >>> >>>>>>>>>> coordinate >> >>> >>>>>>>>>>>> commits to >> >>> >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is >> >>> widely >> >>> >>>>>>>> using >> >>> >>>>>>>>> in >> >>> >>>>>>>>>>>> flink kafka >> >>> >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc >> >>> connector, >> >>> >>>>>>>>>> flink >> >>> >>>>>>>>>>>> hudi >> >>> >>>>>>>>>>>>>>>>>>> connector and so on. >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group >> scope >> >>> for the >> >>> >>>>>>>>>>>>>>>>> OperatorCoordinator >> >>> >>>>>>>>>>>>>>>>>>> and not an implementation for the interface >> >>> >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup. >> >>> >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be >> >>> how many >> >>> >>>>>>>>>>>> splits/partitions >> >>> >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how >> many >> >>> files >> >>> >>>>>>>> have >> >>> >>>>>>>>>> been >> >>> >>>>>>>>>>>> written >> >>> >>>>>>>>>>>>>>>>> out >> >>> >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help >> >>> users to >> >>> >>>>>>>> know >> >>> >>>>>>>>>> the >> >>> >>>>>>>>>>>> job >> >>> >>>>>>>>>>>>>>>>> progress >> >>> >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. >> Thus we >> >>> >>>>>>>> propose >> >>> >>>>>>>>>> the >> >>> >>>>>>>>>>>> FLIP-274 >> >>> >>>>>>>>>>>>>>>>> to >> >>> >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for >> >>> >>>>>>>> OperatorCoordinator >> >>> >>>>>>>>>> and >> >>> >>>>>>>>>>>> provide >> >>> >>>>>>>>>>>>>>>>> an >> >>> >>>>>>>>>>>>>>>>>>> internal implementation for >> >>> >>>>>>>>> OperatorCoordinatorMetricGroup. >> >>> >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get >> >>> time? Any >> >>> >>>>>>>>>>> feedback >> >>> >>>>>>>>>>>> is >> >>> >>>>>>>>>>>>>>>>>>> appreciated! >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>>> Best, >> >>> >>>>>>>>>>>>>>>>>>> Hang >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>>> [1] >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>>> >> >>> >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator >> >>> >> >> >>> >> >> >