Hi all, IIUC, Chesnay means we should have a more general metric group for **operator** not for **coordinator** in JM, this would be useful to extend other operator-specific metrics in the future. That means the new scope format should be designed for the operator, e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name> The coordinator metric is a subgroup (a constant "coordinator" suffix) of the JMOperatorMG.
I think this is a nice design. However, I have a question about adding task id/name to this list. How to get the task id/name when reporting a coordinator metric? As we know, coordinator metrics are just like JMJobMetricGroup, which don't belong to any tasks/vertexes. Do you mean reporting coordinator metrics for every task id under the operator? Best, Jark On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org> wrote: > > First, I do not understand why users have to configure the new scope > format, which has a default value. > > If you don't use scope formats, sure. If you do use scope formats, e.g. > to add a common prefix (which is the case for datadog users for > example), then the current default in the FLIP is insufficient and > requires the user to update the configuration. > > > I usually do not need to change these configuration of scope formats > when submitting the flink job. > > Scope formats as a whole are quite a power-user feature, but that > doesn't mean we should ignore it. > > > I try to let it extend the ComponentMetricGroup > > This isn't inherently required just because it is a "component". > Component metric groups should _only_ be used for cases where such a > component provides several bits of metadata. > For example, tasks provide vertex ids, task names, attempted IDs etc., > and we'd like users to have the option on how this metadata ends up > being used (via scope formats). This currently can't be built with the > addGroup() methods, hence why the component groups exist. > However, the operator coordinator _itself_does not provide several bits > of metadata. Logically, the _operator_ does, not the coordinator. > > > This make me decide to add a new scope format. > > And this is fine; just don't add a new scope format for the coordinator > but logically for operators on the JM side, such that we can extend the > set of operator-specific metrics in the future without having to add yet > another scope format. > > > The relationship between the Operator and OperatorCoordinator is > maintained in the OperatorCoordinatorHolder. In the POC, the operator > name/id could be found and the OperatorCoordinatorMetricGroup will be > created here. > > That's all irrelevant. > > > The registered metrics of OperatorCoordinator always are those which > can not be aggregated from the tasks, like the number of unassigned > splits in the SourceCoordinator. Actually we have not encountered the > scenario you mentioned. I think the OperatorCoordinatorMetricGroup > should only contain the metrics for itself instead of its subtasks. > > You are misunderstanding the problem. > > This isn't about aggregating metrics on our side or exposing metrics > from TMs on the JM side or anything like that. > > It's purely about the available metadata for operator metrics on the JM > side. Currently you suggest operator name / id; what I'm proposing is to > add task(==vertex!) id / name to this list. > > The benefit here is simple. I can look up all metrics where task_id==XXX > and get _everything_ related to that vertex, including all metrics > associated with operators that are part of that vertex, including the > coordinator. > > > Using metrics.scope.jm.job is not enough to distinguish the different > coordinators. > > I did not suggest just using metrics.scope.jm.job. I suggested that as > the default for the jm operator scope format, that is used for the > proposed JobManagerOperatorMetricGroup, which will have some plain > metric groups as children for coordinators. > Aka, you have the JMOperatorMG, then you call addGroup("coordinators") > and then addGroup(coordinator_name) for each coordinator. > > > So there are two choice for the > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope > format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one > is better? > > There are more options than that. > I feel like there are a lot of misunderstandings in this discussion. > Please let me know if I could clear things up. If not I can also provide > a PoC based on your PoC if that can speed things up. It's not that > different anyway. > > On 19/01/2023 07:39, Hang Ruan wrote: > > Hi, chesnay, > > > > Thanks for your reply. I still have some doubts about the questions > > you raised. > > > > > Extending the set of ScopeFormats is problematic because it in > > practice > > it breaks the config if users actively rely on it, since there's now > > another key that they _must_ set for it to be > > consistent/compatible with > > their existing setup. > > Unfortunately due to how powerful scope formats are we can't derive a > > default value that matches their existing setup. > > Hence we should try to do this as rarely as possible. > > > > > > > This FLIP does not adhere to that since it proposes a dedicated > > format > > for coordinators; next time we want to expose operator-specific > > metrics > > (e.g., in the scheduler) we'd have to add another one to support it. > > > > > > First, I do not understand why users have to configure the new scope > > format, which has a default value. I usually do not need to change > > these configuration of scope formats when submitting the flink job. > > IMO, the OperatorCoordinator is a component running at the JobMaster. > > I try to let it extend the ComponentMetricGroup, but I cannot find > > a suitable scope. This make me decide to add a new scope format. > > > > > Additionally, the configurable variables (operator name/id) are > > logically not attached to the coordinator, but operators, so to me it > > just doesn't make sense to structure it like this. > > > > > > The relationship between the Operator and OperatorCoordinator is > > maintained in the OperatorCoordinatorHolder. In the POC, the operator > > name/id could be found and the OperatorCoordinatorMetricGroup will be > > created here. > > > > > Another thing I'm concerned about is that, because we don't include > > tasks in the hierarchy, users wishing to collect all metrics for a > > particular task (in this case ==vertex) now have to go > > significantly out > > of their way to get them, since they can no longer just filter by the > > task ID but have to be filter for _all_ operators that are part of > > the task. > > > > > > The registered metrics of OperatorCoordinator always are those which > > can not be aggregated from the tasks, like the number of unassigned > > splits in the SourceCoordinator. Actually we have not encountered the > > scenario you mentioned. I think the OperatorCoordinatorMetricGroup > > should only contain the metrics for itself instead of its subtasks. > > > > Using metrics.scope.jm.job is not enough to distinguish the different > > coordinators. The operator id/name is necessary. Then > > the implementation should be like this. But users can not change the > > scope in this way. This is also acceptable. > > @Internal public class InternalOperatorCoordinatorMetricGroup > > extends ProxyMetricGroup<MetricGroup> > > implements OperatorCoordinatorMetricGroup { > > public InternalOperatorCoordinatorMetricGroup( > > JobManagerJobMetricGroup parent, OperatorID operatorID, > String operatorName) { > > super(parent.addGroup(operatorID + > operatorName).addGroup("coordinator")); } > > } > > So there are two choice for the > > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new > > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. > > Which one is better? > > > > Best, > > Hang > > > > > > Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道: > > > > You're misunderstanding the problem. > > > > Metric groups form a tree with each group providing certain metadata. > > > > E.g., on the taskmanager we have a TM metric group that provides info > > about the TM, that has child task metric groups, that have child > > operator metric groups etc. The operator metric group again has > > children, where we are now mostly in the user/connector land, like a > > kafka metric group providing partition info or something. > > > > What is being proposed is to create a coordinator metric group on > > the JM > > side that provides operator metadata. > > > > A more appropriate structure is to create an operator group on the JM > > side that provides this info, with the coordinator metric group > > being a > > child. > > > > On 17/01/2023 19:56, Steven Wu wrote: > > >> Additionally, the configurable variables (operator name/id) are > > > logically not attached to the coordinator, but operators, so to > > me it > > > just doesn't make sense to structure it like this. > > > > > > Chesnay, maybe we should clarify the terminology. To me, > > pperators (like > > > FLIP-27 source) can have two parts (coordinator and > > reader/subtask). I > > > think it is fine to include operator name/id for coordinator > > metrics. > > > > > > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler > > <ches...@apache.org> wrote: > > > > > >> Slight correction: Using metrics.scope.jm.job as the default > > should be > > >> safe. > > >> > > >> On 16/01/2023 10:18, Chesnay Schepler wrote: > > >>> The proposed ScopeFormat is still problematic for a few reasons. > > >>> > > >>> Extending the set of ScopeFormats is problematic because it in > > >>> practice it breaks the config if users actively rely on it, since > > >>> there's now another key that they _must_ set for it to be > > >>> consistent/compatible with their existing setup. > > >>> Unfortunately due to how powerful scope formats are we can't > > derive a > > >>> default value that matches their existing setup. > > >>> Hence we should try to do this as rarely as possible. > > >>> > > >>> This FLIP does not adhere to that since it proposes a > > dedicated format > > >>> for coordinators; next time we want to expose operator-specific > > >>> metrics (e.g., in the scheduler) we'd have to add another one to > > >>> support it. > > >>> > > >>> Additionally, the configurable variables (operator name/id) are > > >>> logically not attached to the coordinator, but operators, so > > to me it > > >>> just doesn't make sense to structure it like this. > > >>> > > >>> Another thing I'm concerned about is that, because we don't > > include > > >>> tasks in the hierarchy, users wishing to collect all metrics for > a > > >>> particular task (in this case ==vertex) now have to go > > significantly > > >>> out of their way to get them, since they can no longer just > > filter by > > >>> the task ID but have to be filter for _all_ operators that are > > part of > > >>> the task. > > >>> > > >>> On 16/01/2023 03:09, Hang Ruan wrote: > > >>>> Hi, @ches...@apache.org <ches...@apache.org> , > > >>>> > > >>>> Do you have time to help to review this FLIP again after the > > >>>> modification? > > >>>> Looking forward to your reply. > > >>>> This FLIP will add a new configuration for the > > >>>> OperatorCoordinatorMetricGroup scope format. It provides an > > internal > > >>>> implementation and is added as a component to the > > >>>> JobManagerJobMetricGroup. > > >>>> If something doesn't make sense, could you provide some > > advice? It > > >>>> will be > > >>>> very helpful. Thanks a lot for your help. > > >>>> > > >>>> Best, > > >>>> Hang > > >>>> > > >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 > > 16:34写道: > > >>>> > > >>>> > > >>>>> Hi Hang, > > >>>>> > > >>>>> I'm a bit surprised that this has gone to a vote, given that > > Chesnay > > >>>>> deliberately mentioned that he would vote against it as-is. > > I would > > >>>>> expect > > >>>>> that before going to a vote, he has had the opportunity to > > >>>>> participate in > > >>>>> this discussion. > > >>>>> > > >>>>> Best regards, > > >>>>> > > >>>>> Martijn > > >>>>> > > >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com> > > wrote: > > >>>>> > > >>>>>> Hi Dong, > > >>>>>> > > >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only > > concern is > > >>>>>> that > > >>>>>> this is a core interface for the FLIP. It’s hard to tell how > > >>>>>> sources use > > >>>>>> metric group without mentioning this interface. Even if > > this is an > > >>>>> existing > > >>>>>> API, I think it’s worth introducing the interface again and > > declaring > > >>>>> that > > >>>>>> we will implement the interface instead of a no-op method > > in this > > >>>>>> FLIP. > > >>>>>> > > >>>>>> Anyway, this is a minor problem and shouldn’t block this > > FLIP. I’m > > >>>>>> +1 to > > >>>>>> start a vote. > > >>>>>> > > >>>>>> Best, > > >>>>>> Jark > > >>>>>> > > >>>>>> > > >>>>>>> 2023年1月3日 10:03,Hang Ruan <ruanhang1...@gmail.com> 写道: > > >>>>>>> > > >>>>>>> Hi, Jark and Dong, > > >>>>>>> > > >>>>>>> Thanks for your comments. Sorry for my late reply. > > >>>>>>> > > >>>>>>> For suggestion 1, I plan to implement the > > >>>>>>> SplitEnumeratorMetricGroup in > > >>>>>>> another issue, and it is not contained in this FLIP. I > > will add some > > >>>>>>> description about this part. > > >>>>>>> For suggestion 2, changes about > > OperatorCoordinator#metricGroup has > > >>>>>> already > > >>>>>>> been documented in the proposed change section. > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Hang > > >>>>>>> > > >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道: > > >>>>>>> > > >>>>>>>> Let me chime-in and add comments regarding the public > > interface > > >>>>> section. > > >>>>>>>> Please see my comments inline. > > >>>>>>>> > > >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu > > <imj...@gmail.com> wrote: > > >>>>>>>> > > >>>>>>>>> Hi Hang, > > >>>>>>>>> > > >>>>>>>>> Thanks for driving this discussion. I think this is a > > very useful > > >>>>>> feature > > >>>>>>>>> for connectors. > > >>>>>>>>> > > >>>>>>>>> The FLIP looks quite good to me, and I just have two > > suggestions. > > >>>>>>>>> > > >>>>>>>>> 1. In the "Public Interface" section, mention that the > > >>>>>>>>> implementation > > >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is > > changed from > > >>>>>>>> returning > > >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup > > instance. > > >>>>> Even > > >>>>>>>>> though the API is already there, the behavior change can > > also be > > >>>>>>>> considered > > >>>>>>>>> a public change. > > >>>>>>>>> > > >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and > > this FLIP > > >>>>>>>> does > > >>>>>> not > > >>>>>>>> seem to change its semantics/behavior. The FLIP does > > change the > > >>>>>>>> implementation/behavior of > > SourceCoordinatorContext#metricGroup, > > >>>>>>>> which > > >>>>>> is > > >>>>>>>> marked @Internal. > > >>>>>>>> > > >>>>>>>> Thus it might seem a bit weird to add in the public > interface > > >>>>>>>> section > > >>>>>>>> saying that we change the interface > > >>>>>>>> SplitEnumeratorContext#metricGroup > > >>>>>> from > > >>>>>>>> returning null to non-null object. > > >>>>>>>> > > >>>>>>>> 2. Mention the newly added interface of > > >>>>>> "OperatorCoordinator#metricGroup" > > >>>>>>>>> in the "Proposed Changes" section or "Public Interface" > > section. As > > >>>>> the > > >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many > > connectors. > > >>>>>> Though > > >>>>>>>> it > > >>>>>>>>> is still an @Internal API, I think it is worth > > mentioning the > > >>>>>>>>> change > > >>>>> in > > >>>>>>>> the > > >>>>>>>>> FLIP. > > >>>>>>>> Since OperatorCoordinator is an internal API, it seems > > reasonable to > > >>>>>>>> explain it in the proposed change section. The FLIP seems > > to have > > >>>>>>>> documented this in point 5 of the proposed change section. > > >>>>>>>> > > >>>>>>>> BTW, if we think there are @internal classes that are > > important > > >>>>>>>> enough > > >>>>>> to > > >>>>>>>> be added in the public interface section, it might be > > useful to > > >>>>>> explicitly > > >>>>>>>> discuss this topic and document it in the "*What are the > > "public > > >>>>>>>> interfaces" of the project*" in this > > >>>>>>>> < > > >>>>>>>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > >>>>>>>> wiki. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Jark > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan > > <ruanhang1...@gmail.com> > > >>>>>> wrote: > > >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng. > > >>>>>>>>>> After combining everyone's comments, the main concerns and > > >>>>>>>> corresponding > > >>>>>>>>>> adjustments are as follows. > > >>>>>>>>>> > > >>>>>>>>>> Q1: Common metrics are not quite useful. > > >>>>>>>>>> numEventsIn and numEventsOut counters will be removed > > from the > > >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do > not > > >>>>>>>>>> provide > > >>>>>>>>> enough > > >>>>>>>>>> information for users. The users are more willing to > > get the > > >>>>>>>>>> number > > >>>>> of > > >>>>>>>>>> events of the specified type instead of the total > > number. And this > > >>>>>>>> metric > > >>>>>>>>>> is calculated differently. The implementation could > > register the > > >>>>>> metric > > >>>>>>>>> by > > >>>>>>>>>> themselves. > > >>>>>>>>>> > > >>>>>>>>>> Q2: This FLIP is overly complicated. > > >>>>>>>>>> This FLIP will become concise after these modifications. > > >>>>>>>>>> OperatorCoordinatorMetricGroup has already been > > introduced into > > >>>>> Flink > > >>>>>>>> by > > >>>>>>>>>> FLIP-179< > > >>>>>>>>>> > > >>>>>>>>>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics > > >>>>>>>>>>> . > > >>>>>>>>>> And > > >>>>>>>>>> this FLIP will not change it. This FLIP only provides a > > new metric > > >>>>>>>> option > > >>>>>>>>>> and a new metric group scope. The changes in proposed > > changes > > >>>>> provide > > >>>>>>>> the > > >>>>>>>>>> details about the modifications for the internal > > classes, which > > >>>>> might > > >>>>>>>>> make > > >>>>>>>>>> it look complicated. > > >>>>>>>>>> > > >>>>>>>>>> Thanks for all the comments again. If there are no further > > >>>>>>>>>> comments, > > >>>>>> we > > >>>>>>>>>> plan to start the voting thread this week. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Hang > > >>>>>>>>>> > > >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一 > > 16:48写道: > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> Thanks for the FLIP, Hang! > > >>>>>>>>>>> > > >>>>>>>>>>> This FLIP overall looks good to me. Actually I share > > the same > > >>>>> concern > > >>>>>>>>>> with > > >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not > > quite > > >>>>>>>>>>> useful > > >>>>>>>> to > > >>>>>>>>>> end > > >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, > > which > > >>>>> requires > > >>>>>>>>>>> instantialization in order to be practical to users and > > >>>>>>>>>>> developers, > > >>>>>>>> so > > >>>>>>>>>>> maybe it's better to exclude them from the FLIP. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Qingsheng > > >>>>>>>>>>> Ververica (Alibaba) > > >>>>>>>>>>> > > >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu > > <reed...@gmail.com> > > >>>>> wrote: > > >>>>>>>>>>>> Hi Hang, > > >>>>>>>>>>>> I still see no strong reason why we need > > >>>>>>>>>>>> numEventsIn/numEventsOut > > >>>>>>>>>>> metrics. > > >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same > > concern > > >>>>>>>>>>>> from > > >>>>>>>>>> others. > > >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid > > >>>>>>>>>>>> over-extending > > >>>>>>>>> the > > >>>>>>>>>>>> scope. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks, > > >>>>>>>>>>>> Zhu > > >>>>>>>>>>>> > > >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五 > > >>>>>>>>>>>> 15:21写道: > > >>>>>>>>>>>>> Hi everyone, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep > > the common > > >>>>>>>>>>> metrics, > > >>>>>>>>>>>> which will help to observe incoming and outgoing > > events. What do > > >>>>>>>> you > > >>>>>>>>>>> think, > > >>>>>>>>>>>> @Zhu Zhu ? > > >>>>>>>>>>>>> And @Chesnay, are there any other issues you are > > more concerned > > >>>>>>>>>> about? > > >>>>>>>>>>>> Looking forward to your reply. > > >>>>>>>>>>>>> Thanks for all the comments. If there are no further > > >>>>>>>>>>>>> comments, we > > >>>>>>>>>> plan > > >>>>>>>>>>>> to start the voting thread next week. > > >>>>>>>>>>>>> Best, > > >>>>>>>>>>>>> Hang > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四 > > >>>>>>>>>>>>> 16:49写道: > > >>>>>>>>>>>>>> Hi, Zhu Zhu, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for your feedback! > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The OperatorCoordinator implementations are > > different. And > > >>>>>>>>>>>>>> their > > >>>>>>>>>>>> metrics are much different too. We try to find the > common > > >>>>>>>>>>>> metrics > > >>>>>>>> and > > >>>>>>>>>> put > > >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most > > developers > > >>>>>>>> think > > >>>>>>>>> we > > >>>>>>>>>>> do > > >>>>>>>>>>>> not need these common metrics, removing the common > > metrics is > > >>>>>>>>>>>> also > > >>>>>>>>>>>> acceptable. > > >>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>> Hang > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三 > > 22:09写道: > > >>>>>>>>>>>>>>> Hi Hang & MengYue, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks for creating this FLIP! > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects: > > >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics. > > >>>>>>>>>>>>>>> Currently > > >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator > > >>>>>>>> coordinator > > >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator > > to have > > >>>>>>>>>>>>>>> access > > >>>>>>>>>>>>>>> to a registered metric group (via the existing public > > >>>>>>>>>>>>>>> interface > > >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is > > null at the > > >>>>>>>>> moment. > > >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator > > metrics. A clear > > >>>>>>>>>>>> definition > > >>>>>>>>>>>>>>> makes it easy for users to find their wanted > > metrics. The > > >>>>>>>>>> definition > > >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from > multiple > > >>>>>>>>>>>> OperatorCoordinators > > >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may > > have its > > >>>>>>>>>>>>>>> own > > >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not > > be directly > > >>>>>>>>>>> registered > > >>>>>>>>>>>>>>> to the job metric group. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the > > introduced > > >>>>>>>>> common > > >>>>>>>>>>>> metrics > > >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there > > any case > > >>>>>>>> which > > >>>>>>>>>>>> strongly > > >>>>>>>>>>>>>>> requires them? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Regarding the concerns of Chesnay, > > >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is > overkill > > >>>>>>>>>>>>>>> Directly using the job metric group can result in > > metric > > >>>>>>>>> conflicts, > > >>>>>>>>>>> as > > >>>>>>>>>>>> mentioned > > >>>>>>>>>>>>>>> in above #2. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>> Zhu > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六 > > >>>>>>>>>>>>>>> 14:16写道: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Hi Chesney, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Just to double check with you, > > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup > > >>>>>>>>>>>> (annotated as > > >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into > > Flink by > > >>>>>>>>>> FLIP-179 > > >>>>>>>>>>>>>>>> < > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics > > >>>>>>>>>>>>> . > > >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we > > should remove > > >>>>>>>>> this > > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup? > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Regards, > > >>>>>>>>>>>>>>>> Dong > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler < > > >>>>>>>>>>> ches...@apache.org> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly > > complicated. A > > >>>>>>>>>>> dedicated > > >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it > > could just > > >>>>>>>>> re-use > > >>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same > > structure we > > >>>>>>>>> have > > >>>>>>>>>> on > > >>>>>>>>>>>> TMs, > > >>>>>>>>>>>>>>>>> similar to what we did with the Job MG. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required > > anyway, > > >>>>>>>>>> because > > >>>>>>>>>>>> all the > > >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on > > the TM > > >>>>>>>> side > > >>>>>>>>> + > > >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries > > to pull a > > >>>>>>>>> fast > > >>>>>>>>>>>> one on > > >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be > > against it. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote: > > >>>>>>>>>>>>>>>>>> Hi Hang, > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it > > is pretty > > >>>>>>>>>>>> informative. > > >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names: > > >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as > > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for > > >>>>>>>> consistency > > >>>>>>>>>>> with > > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job > > >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)? > > >>>>>>>>>>>>>>>>>> - Maybe rename the variable as > > >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR > > >>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>> simplicity and consistency with > > SCOPE_NAMING_OPERATOR > > >>>>>>>>> (which > > >>>>>>>>>> is > > >>>>>>>>>>>> not named > > >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)? > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Cheers, > > >>>>>>>>>>>>>>>>>> Dong > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan < > > >>>>>>>>>>> ruanhang1...@gmail.com> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> Hi all, > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce > > metric group > > >>>>>>>>> for > > >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the > > >>>>>>>>> coordinator > > >>>>>>>>>>> for > > >>>>>>>>>>>> runtime > > >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The > > coordination > > >>>>>>>>>>>> mechanism is > > >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator > > and its all > > >>>>>>>>>>>> operators, the > > >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for > > >>>>>>>> example > > >>>>>>>>>> many > > >>>>>>>>>>>> Sources > > >>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits > and > > >>>>>>>>>> coordinate > > >>>>>>>>>>>> commits to > > >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is > > widely > > >>>>>>>> using > > >>>>>>>>> in > > >>>>>>>>>>>> flink kafka > > >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc > > connector, > > >>>>>>>>>> flink > > >>>>>>>>>>>> hudi > > >>>>>>>>>>>>>>>>>>> connector and so on. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope > > for the > > >>>>>>>>>>>>>>>>> OperatorCoordinator > > >>>>>>>>>>>>>>>>>>> and not an implementation for the interface > > >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup. > > >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be > > how many > > >>>>>>>>>>>> splits/partitions > > >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many > > files > > >>>>>>>> have > > >>>>>>>>>> been > > >>>>>>>>>>>> written > > >>>>>>>>>>>>>>>>> out > > >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help > > users to > > >>>>>>>> know > > >>>>>>>>>> the > > >>>>>>>>>>>> job > > >>>>>>>>>>>>>>>>> progress > > >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we > > >>>>>>>> propose > > >>>>>>>>>> the > > >>>>>>>>>>>> FLIP-274 > > >>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for > > >>>>>>>> OperatorCoordinator > > >>>>>>>>>> and > > >>>>>>>>>>>> provide > > >>>>>>>>>>>>>>>>> an > > >>>>>>>>>>>>>>>>>>> internal implementation for > > >>>>>>>>> OperatorCoordinatorMetricGroup. > > >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get > > time? Any > > >>>>>>>>>>> feedback > > >>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>>> appreciated! > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>> Hang > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> [1] > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator > > >> > > >