Hi all, Thanks for all the comments. Sorry for my late reply.
I think I understand most of the suggestions. I still have the same question about the adding task id/name like Jark. And I have refactored the POC like this ( https://github.com/ruanhang1993/flink/commit/321192d17afbb9d00285a78a121676ae5e371292), whose scope has the task name. When we initialize the coordinator, it is hard to get the task name like we do in the TaskMetricGroup. I also found a weird thing that ExecutionJobVertex#getTaskInformationOrBlobKey will use the jobVertex.getName() as the task name. But ExecutionJobVertex#createOperatorCoordinatorHolder also will use the jobVertex.getName() as the operatorName for LazyInitializedCoordinatorContext. Maybe the task name will be duplicated to the operator name in the metric name. Best, Hang Jark Wu <imj...@gmail.com> 于2023年1月28日周六 16:15写道: > Hi all, > > IIUC, Chesnay means we should have a more general metric group for > **operator** > not for **coordinator** in JM, this would be useful to extend > other operator-specific > metrics in the future. That means the new scope format should be designed > for > the operator, > e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name> > The coordinator metric is a subgroup (a constant "coordinator" suffix) of > the JMOperatorMG. > > I think this is a nice design. However, I have a question about adding > task id/name to this list. > How to get the task id/name when reporting a coordinator metric? As we > know, coordinator metrics > are just like JMJobMetricGroup, which don't belong to any tasks/vertexes. > Do you mean reporting > coordinator metrics for every task id under the operator? > > Best, > Jark > > > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org> wrote: > >> > First, I do not understand why users have to configure the new scope >> format, which has a default value. >> >> If you don't use scope formats, sure. If you do use scope formats, e.g. >> to add a common prefix (which is the case for datadog users for >> example), then the current default in the FLIP is insufficient and >> requires the user to update the configuration. >> >> > I usually do not need to change these configuration of scope formats >> when submitting the flink job. >> >> Scope formats as a whole are quite a power-user feature, but that >> doesn't mean we should ignore it. >> >> > I try to let it extend the ComponentMetricGroup >> >> This isn't inherently required just because it is a "component". >> Component metric groups should _only_ be used for cases where such a >> component provides several bits of metadata. >> For example, tasks provide vertex ids, task names, attempted IDs etc., >> and we'd like users to have the option on how this metadata ends up >> being used (via scope formats). This currently can't be built with the >> addGroup() methods, hence why the component groups exist. >> However, the operator coordinator _itself_does not provide several bits >> of metadata. Logically, the _operator_ does, not the coordinator. >> >> > This make me decide to add a new scope format. >> >> And this is fine; just don't add a new scope format for the coordinator >> but logically for operators on the JM side, such that we can extend the >> set of operator-specific metrics in the future without having to add yet >> another scope format. >> >> > The relationship between the Operator and OperatorCoordinator is >> maintained in the OperatorCoordinatorHolder. In the POC, the operator >> name/id could be found and the OperatorCoordinatorMetricGroup will be >> created here. >> >> That's all irrelevant. >> >> > The registered metrics of OperatorCoordinator always are those which >> can not be aggregated from the tasks, like the number of unassigned >> splits in the SourceCoordinator. Actually we have not encountered the >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup >> should only contain the metrics for itself instead of its subtasks. >> >> You are misunderstanding the problem. >> >> This isn't about aggregating metrics on our side or exposing metrics >> from TMs on the JM side or anything like that. >> >> It's purely about the available metadata for operator metrics on the JM >> side. Currently you suggest operator name / id; what I'm proposing is to >> add task(==vertex!) id / name to this list. >> >> The benefit here is simple. I can look up all metrics where task_id==XXX >> and get _everything_ related to that vertex, including all metrics >> associated with operators that are part of that vertex, including the >> coordinator. >> >> > Using metrics.scope.jm.job is not enough to distinguish the different >> coordinators. >> >> I did not suggest just using metrics.scope.jm.job. I suggested that as >> the default for the jm operator scope format, that is used for the >> proposed JobManagerOperatorMetricGroup, which will have some plain >> metric groups as children for coordinators. >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators") >> and then addGroup(coordinator_name) for each coordinator. >> >> > So there are two choice for the >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one >> is better? >> >> There are more options than that. >> I feel like there are a lot of misunderstandings in this discussion. >> Please let me know if I could clear things up. If not I can also provide >> a PoC based on your PoC if that can speed things up. It's not that >> different anyway. >> >> On 19/01/2023 07:39, Hang Ruan wrote: >> > Hi, chesnay, >> > >> > Thanks for your reply. I still have some doubts about the questions >> > you raised. >> > >> > > Extending the set of ScopeFormats is problematic because it in >> > practice >> > it breaks the config if users actively rely on it, since there's now >> > another key that they _must_ set for it to be >> > consistent/compatible with >> > their existing setup. >> > Unfortunately due to how powerful scope formats are we can't derive >> a >> > default value that matches their existing setup. >> > Hence we should try to do this as rarely as possible. >> > >> > >> > > This FLIP does not adhere to that since it proposes a dedicated >> > format >> > for coordinators; next time we want to expose operator-specific >> > metrics >> > (e.g., in the scheduler) we'd have to add another one to support it. >> > >> > >> > First, I do not understand why users have to configure the new scope >> > format, which has a default value. I usually do not need to change >> > these configuration of scope formats when submitting the flink job. >> > IMO, the OperatorCoordinator is a component running at the JobMaster. >> > I try to let it extend the ComponentMetricGroup, but I cannot find >> > a suitable scope. This make me decide to add a new scope format. >> > >> > > Additionally, the configurable variables (operator name/id) are >> > logically not attached to the coordinator, but operators, so to me >> it >> > just doesn't make sense to structure it like this. >> > >> > >> > The relationship between the Operator and OperatorCoordinator is >> > maintained in the OperatorCoordinatorHolder. In the POC, the operator >> > name/id could be found and the OperatorCoordinatorMetricGroup will be >> > created here. >> > >> > > Another thing I'm concerned about is that, because we don't >> include >> > tasks in the hierarchy, users wishing to collect all metrics for a >> > particular task (in this case ==vertex) now have to go >> > significantly out >> > of their way to get them, since they can no longer just filter by >> the >> > task ID but have to be filter for _all_ operators that are part of >> > the task. >> > >> > >> > The registered metrics of OperatorCoordinator always are those which >> > can not be aggregated from the tasks, like the number of unassigned >> > splits in the SourceCoordinator. Actually we have not encountered the >> > scenario you mentioned. I think the OperatorCoordinatorMetricGroup >> > should only contain the metrics for itself instead of its subtasks. >> > >> > Using metrics.scope.jm.job is not enough to distinguish the different >> > coordinators. The operator id/name is necessary. Then >> > the implementation should be like this. But users can not change the >> > scope in this way. This is also acceptable. >> > @Internal public class InternalOperatorCoordinatorMetricGroup >> > extends ProxyMetricGroup<MetricGroup> >> > implements OperatorCoordinatorMetricGroup { >> > public InternalOperatorCoordinatorMetricGroup( >> > JobManagerJobMetricGroup parent, OperatorID operatorID, >> String operatorName) { >> > super(parent.addGroup(operatorID + >> operatorName).addGroup("coordinator")); } >> > } >> > So there are two choice for the >> > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new >> > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. >> > Which one is better? >> > >> > Best, >> > Hang >> > >> > >> > Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道: >> > >> > You're misunderstanding the problem. >> > >> > Metric groups form a tree with each group providing certain >> metadata. >> > >> > E.g., on the taskmanager we have a TM metric group that provides >> info >> > about the TM, that has child task metric groups, that have child >> > operator metric groups etc. The operator metric group again has >> > children, where we are now mostly in the user/connector land, like a >> > kafka metric group providing partition info or something. >> > >> > What is being proposed is to create a coordinator metric group on >> > the JM >> > side that provides operator metadata. >> > >> > A more appropriate structure is to create an operator group on the >> JM >> > side that provides this info, with the coordinator metric group >> > being a >> > child. >> > >> > On 17/01/2023 19:56, Steven Wu wrote: >> > >> Additionally, the configurable variables (operator name/id) are >> > > logically not attached to the coordinator, but operators, so to >> > me it >> > > just doesn't make sense to structure it like this. >> > > >> > > Chesnay, maybe we should clarify the terminology. To me, >> > pperators (like >> > > FLIP-27 source) can have two parts (coordinator and >> > reader/subtask). I >> > > think it is fine to include operator name/id for coordinator >> > metrics. >> > > >> > > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler >> > <ches...@apache.org> wrote: >> > > >> > >> Slight correction: Using metrics.scope.jm.job as the default >> > should be >> > >> safe. >> > >> >> > >> On 16/01/2023 10:18, Chesnay Schepler wrote: >> > >>> The proposed ScopeFormat is still problematic for a few reasons. >> > >>> >> > >>> Extending the set of ScopeFormats is problematic because it in >> > >>> practice it breaks the config if users actively rely on it, >> since >> > >>> there's now another key that they _must_ set for it to be >> > >>> consistent/compatible with their existing setup. >> > >>> Unfortunately due to how powerful scope formats are we can't >> > derive a >> > >>> default value that matches their existing setup. >> > >>> Hence we should try to do this as rarely as possible. >> > >>> >> > >>> This FLIP does not adhere to that since it proposes a >> > dedicated format >> > >>> for coordinators; next time we want to expose operator-specific >> > >>> metrics (e.g., in the scheduler) we'd have to add another one to >> > >>> support it. >> > >>> >> > >>> Additionally, the configurable variables (operator name/id) are >> > >>> logically not attached to the coordinator, but operators, so >> > to me it >> > >>> just doesn't make sense to structure it like this. >> > >>> >> > >>> Another thing I'm concerned about is that, because we don't >> > include >> > >>> tasks in the hierarchy, users wishing to collect all metrics >> for a >> > >>> particular task (in this case ==vertex) now have to go >> > significantly >> > >>> out of their way to get them, since they can no longer just >> > filter by >> > >>> the task ID but have to be filter for _all_ operators that are >> > part of >> > >>> the task. >> > >>> >> > >>> On 16/01/2023 03:09, Hang Ruan wrote: >> > >>>> Hi, @ches...@apache.org <ches...@apache.org> , >> > >>>> >> > >>>> Do you have time to help to review this FLIP again after the >> > >>>> modification? >> > >>>> Looking forward to your reply. >> > >>>> This FLIP will add a new configuration for the >> > >>>> OperatorCoordinatorMetricGroup scope format. It provides an >> > internal >> > >>>> implementation and is added as a component to the >> > >>>> JobManagerJobMetricGroup. >> > >>>> If something doesn't make sense, could you provide some >> > advice? It >> > >>>> will be >> > >>>> very helpful. Thanks a lot for your help. >> > >>>> >> > >>>> Best, >> > >>>> Hang >> > >>>> >> > >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三 >> > 16:34写道: >> > >>>> >> > >>>> >> > >>>>> Hi Hang, >> > >>>>> >> > >>>>> I'm a bit surprised that this has gone to a vote, given that >> > Chesnay >> > >>>>> deliberately mentioned that he would vote against it as-is. >> > I would >> > >>>>> expect >> > >>>>> that before going to a vote, he has had the opportunity to >> > >>>>> participate in >> > >>>>> this discussion. >> > >>>>> >> > >>>>> Best regards, >> > >>>>> >> > >>>>> Martijn >> > >>>>> >> > >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com> >> > wrote: >> > >>>>> >> > >>>>>> Hi Dong, >> > >>>>>> >> > >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only >> > concern is >> > >>>>>> that >> > >>>>>> this is a core interface for the FLIP. It’s hard to tell how >> > >>>>>> sources use >> > >>>>>> metric group without mentioning this interface. Even if >> > this is an >> > >>>>> existing >> > >>>>>> API, I think it’s worth introducing the interface again and >> > declaring >> > >>>>> that >> > >>>>>> we will implement the interface instead of a no-op method >> > in this >> > >>>>>> FLIP. >> > >>>>>> >> > >>>>>> Anyway, this is a minor problem and shouldn’t block this >> > FLIP. I’m >> > >>>>>> +1 to >> > >>>>>> start a vote. >> > >>>>>> >> > >>>>>> Best, >> > >>>>>> Jark >> > >>>>>> >> > >>>>>> >> > >>>>>>> 2023年1月3日 10:03,Hang Ruan <ruanhang1...@gmail.com> 写道: >> > >>>>>>> >> > >>>>>>> Hi, Jark and Dong, >> > >>>>>>> >> > >>>>>>> Thanks for your comments. Sorry for my late reply. >> > >>>>>>> >> > >>>>>>> For suggestion 1, I plan to implement the >> > >>>>>>> SplitEnumeratorMetricGroup in >> > >>>>>>> another issue, and it is not contained in this FLIP. I >> > will add some >> > >>>>>>> description about this part. >> > >>>>>>> For suggestion 2, changes about >> > OperatorCoordinator#metricGroup has >> > >>>>>> already >> > >>>>>>> been documented in the proposed change section. >> > >>>>>>> >> > >>>>>>> Best, >> > >>>>>>> Hang >> > >>>>>>> >> > >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道: >> > >>>>>>> >> > >>>>>>>> Let me chime-in and add comments regarding the public >> > interface >> > >>>>> section. >> > >>>>>>>> Please see my comments inline. >> > >>>>>>>> >> > >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu >> > <imj...@gmail.com> wrote: >> > >>>>>>>> >> > >>>>>>>>> Hi Hang, >> > >>>>>>>>> >> > >>>>>>>>> Thanks for driving this discussion. I think this is a >> > very useful >> > >>>>>> feature >> > >>>>>>>>> for connectors. >> > >>>>>>>>> >> > >>>>>>>>> The FLIP looks quite good to me, and I just have two >> > suggestions. >> > >>>>>>>>> >> > >>>>>>>>> 1. In the "Public Interface" section, mention that the >> > >>>>>>>>> implementation >> > >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is >> > changed from >> > >>>>>>>> returning >> > >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup >> > instance. >> > >>>>> Even >> > >>>>>>>>> though the API is already there, the behavior change can >> > also be >> > >>>>>>>> considered >> > >>>>>>>>> a public change. >> > >>>>>>>>> >> > >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and >> > this FLIP >> > >>>>>>>> does >> > >>>>>> not >> > >>>>>>>> seem to change its semantics/behavior. The FLIP does >> > change the >> > >>>>>>>> implementation/behavior of >> > SourceCoordinatorContext#metricGroup, >> > >>>>>>>> which >> > >>>>>> is >> > >>>>>>>> marked @Internal. >> > >>>>>>>> >> > >>>>>>>> Thus it might seem a bit weird to add in the public >> interface >> > >>>>>>>> section >> > >>>>>>>> saying that we change the interface >> > >>>>>>>> SplitEnumeratorContext#metricGroup >> > >>>>>> from >> > >>>>>>>> returning null to non-null object. >> > >>>>>>>> >> > >>>>>>>> 2. Mention the newly added interface of >> > >>>>>> "OperatorCoordinator#metricGroup" >> > >>>>>>>>> in the "Proposed Changes" section or "Public Interface" >> > section. As >> > >>>>> the >> > >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many >> > connectors. >> > >>>>>> Though >> > >>>>>>>> it >> > >>>>>>>>> is still an @Internal API, I think it is worth >> > mentioning the >> > >>>>>>>>> change >> > >>>>> in >> > >>>>>>>> the >> > >>>>>>>>> FLIP. >> > >>>>>>>> Since OperatorCoordinator is an internal API, it seems >> > reasonable to >> > >>>>>>>> explain it in the proposed change section. The FLIP seems >> > to have >> > >>>>>>>> documented this in point 5 of the proposed change section. >> > >>>>>>>> >> > >>>>>>>> BTW, if we think there are @internal classes that are >> > important >> > >>>>>>>> enough >> > >>>>>> to >> > >>>>>>>> be added in the public interface section, it might be >> > useful to >> > >>>>>> explicitly >> > >>>>>>>> discuss this topic and document it in the "*What are the >> > "public >> > >>>>>>>> interfaces" of the project*" in this >> > >>>>>>>> < >> > >>>>>>>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals >> > >>>>>>>> wiki. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>>> Best, >> > >>>>>>>>> Jark >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan >> > <ruanhang1...@gmail.com> >> > >>>>>> wrote: >> > >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng. >> > >>>>>>>>>> After combining everyone's comments, the main concerns >> and >> > >>>>>>>> corresponding >> > >>>>>>>>>> adjustments are as follows. >> > >>>>>>>>>> >> > >>>>>>>>>> Q1: Common metrics are not quite useful. >> > >>>>>>>>>> numEventsIn and numEventsOut counters will be removed >> > from the >> > >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do >> not >> > >>>>>>>>>> provide >> > >>>>>>>>> enough >> > >>>>>>>>>> information for users. The users are more willing to >> > get the >> > >>>>>>>>>> number >> > >>>>> of >> > >>>>>>>>>> events of the specified type instead of the total >> > number. And this >> > >>>>>>>> metric >> > >>>>>>>>>> is calculated differently. The implementation could >> > register the >> > >>>>>> metric >> > >>>>>>>>> by >> > >>>>>>>>>> themselves. >> > >>>>>>>>>> >> > >>>>>>>>>> Q2: This FLIP is overly complicated. >> > >>>>>>>>>> This FLIP will become concise after these modifications. >> > >>>>>>>>>> OperatorCoordinatorMetricGroup has already been >> > introduced into >> > >>>>> Flink >> > >>>>>>>> by >> > >>>>>>>>>> FLIP-179< >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics >> > >>>>>>>>>>> . >> > >>>>>>>>>> And >> > >>>>>>>>>> this FLIP will not change it. This FLIP only provides a >> > new metric >> > >>>>>>>> option >> > >>>>>>>>>> and a new metric group scope. The changes in proposed >> > changes >> > >>>>> provide >> > >>>>>>>> the >> > >>>>>>>>>> details about the modifications for the internal >> > classes, which >> > >>>>> might >> > >>>>>>>>> make >> > >>>>>>>>>> it look complicated. >> > >>>>>>>>>> >> > >>>>>>>>>> Thanks for all the comments again. If there are no >> further >> > >>>>>>>>>> comments, >> > >>>>>> we >> > >>>>>>>>>> plan to start the voting thread this week. >> > >>>>>>>>>> >> > >>>>>>>>>> Best, >> > >>>>>>>>>> Hang >> > >>>>>>>>>> >> > >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一 >> > 16:48写道: >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>>> Thanks for the FLIP, Hang! >> > >>>>>>>>>>> >> > >>>>>>>>>>> This FLIP overall looks good to me. Actually I share >> > the same >> > >>>>> concern >> > >>>>>>>>>> with >> > >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not >> > quite >> > >>>>>>>>>>> useful >> > >>>>>>>> to >> > >>>>>>>>>> end >> > >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, >> > which >> > >>>>> requires >> > >>>>>>>>>>> instantialization in order to be practical to users and >> > >>>>>>>>>>> developers, >> > >>>>>>>> so >> > >>>>>>>>>>> maybe it's better to exclude them from the FLIP. >> > >>>>>>>>>>> >> > >>>>>>>>>>> Best, >> > >>>>>>>>>>> Qingsheng >> > >>>>>>>>>>> Ververica (Alibaba) >> > >>>>>>>>>>> >> > >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu >> > <reed...@gmail.com> >> > >>>>> wrote: >> > >>>>>>>>>>>> Hi Hang, >> > >>>>>>>>>>>> I still see no strong reason why we need >> > >>>>>>>>>>>> numEventsIn/numEventsOut >> > >>>>>>>>>>> metrics. >> > >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same >> > concern >> > >>>>>>>>>>>> from >> > >>>>>>>>>> others. >> > >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid >> > >>>>>>>>>>>> over-extending >> > >>>>>>>>> the >> > >>>>>>>>>>>> scope. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Thanks, >> > >>>>>>>>>>>> Zhu >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五 >> > >>>>>>>>>>>> 15:21写道: >> > >>>>>>>>>>>>> Hi everyone, >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep >> > the common >> > >>>>>>>>>>> metrics, >> > >>>>>>>>>>>> which will help to observe incoming and outgoing >> > events. What do >> > >>>>>>>> you >> > >>>>>>>>>>> think, >> > >>>>>>>>>>>> @Zhu Zhu ? >> > >>>>>>>>>>>>> And @Chesnay, are there any other issues you are >> > more concerned >> > >>>>>>>>>> about? >> > >>>>>>>>>>>> Looking forward to your reply. >> > >>>>>>>>>>>>> Thanks for all the comments. If there are no further >> > >>>>>>>>>>>>> comments, we >> > >>>>>>>>>> plan >> > >>>>>>>>>>>> to start the voting thread next week. >> > >>>>>>>>>>>>> Best, >> > >>>>>>>>>>>>> Hang >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四 >> > >>>>>>>>>>>>> 16:49写道: >> > >>>>>>>>>>>>>> Hi, Zhu Zhu, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Thanks for your feedback! >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> The OperatorCoordinator implementations are >> > different. And >> > >>>>>>>>>>>>>> their >> > >>>>>>>>>>>> metrics are much different too. We try to find the >> common >> > >>>>>>>>>>>> metrics >> > >>>>>>>> and >> > >>>>>>>>>> put >> > >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most >> > developers >> > >>>>>>>> think >> > >>>>>>>>> we >> > >>>>>>>>>>> do >> > >>>>>>>>>>>> not need these common metrics, removing the common >> > metrics is >> > >>>>>>>>>>>> also >> > >>>>>>>>>>>> acceptable. >> > >>>>>>>>>>>>>> Best, >> > >>>>>>>>>>>>>> Hang >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三 >> > 22:09写道: >> > >>>>>>>>>>>>>>> Hi Hang & MengYue, >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Thanks for creating this FLIP! >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects: >> > >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics. >> > >>>>>>>>>>>>>>> Currently >> > >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator >> > >>>>>>>> coordinator >> > >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator >> > to have >> > >>>>>>>>>>>>>>> access >> > >>>>>>>>>>>>>>> to a registered metric group (via the existing >> public >> > >>>>>>>>>>>>>>> interface >> > >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is >> > null at the >> > >>>>>>>>> moment. >> > >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator >> > metrics. A clear >> > >>>>>>>>>>>> definition >> > >>>>>>>>>>>>>>> makes it easy for users to find their wanted >> > metrics. The >> > >>>>>>>>>> definition >> > >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from >> multiple >> > >>>>>>>>>>>> OperatorCoordinators >> > >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may >> > have its >> > >>>>>>>>>>>>>>> own >> > >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not >> > be directly >> > >>>>>>>>>>> registered >> > >>>>>>>>>>>>>>> to the job metric group. >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the >> > introduced >> > >>>>>>>>> common >> > >>>>>>>>>>>> metrics >> > >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there >> > any case >> > >>>>>>>> which >> > >>>>>>>>>>>> strongly >> > >>>>>>>>>>>>>>> requires them? >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Regarding the concerns of Chesnay, >> > >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is >> overkill >> > >>>>>>>>>>>>>>> Directly using the job metric group can result in >> > metric >> > >>>>>>>>> conflicts, >> > >>>>>>>>>>> as >> > >>>>>>>>>>>> mentioned >> > >>>>>>>>>>>>>>> in above #2. >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Thanks, >> > >>>>>>>>>>>>>>> Zhu >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六 >> > >>>>>>>>>>>>>>> 14:16写道: >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> Hi Chesney, >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> Just to double check with you, >> > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup >> > >>>>>>>>>>>> (annotated as >> > >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into >> > Flink by >> > >>>>>>>>>> FLIP-179 >> > >>>>>>>>>>>>>>>> < >> > >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics >> > >>>>>>>>>>>>> . >> > >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we >> > should remove >> > >>>>>>>>> this >> > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup? >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> Regards, >> > >>>>>>>>>>>>>>>> Dong >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler < >> > >>>>>>>>>>> ches...@apache.org> >> > >>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly >> > complicated. A >> > >>>>>>>>>>> dedicated >> > >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it >> > could just >> > >>>>>>>>> re-use >> > >>>>>>>>>>> the >> > >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same >> > structure we >> > >>>>>>>>> have >> > >>>>>>>>>> on >> > >>>>>>>>>>>> TMs, >> > >>>>>>>>>>>>>>>>> similar to what we did with the Job MG. >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required >> > anyway, >> > >>>>>>>>>> because >> > >>>>>>>>>>>> all the >> > >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on >> > the TM >> > >>>>>>>> side >> > >>>>>>>>> + >> > >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend. >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries >> > to pull a >> > >>>>>>>>> fast >> > >>>>>>>>>>>> one on >> > >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be >> > against it. >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote: >> > >>>>>>>>>>>>>>>>>> Hi Hang, >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it >> > is pretty >> > >>>>>>>>>>>> informative. >> > >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names: >> > >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as >> > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for >> > >>>>>>>> consistency >> > >>>>>>>>>>> with >> > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job >> > >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)? >> > >>>>>>>>>>>>>>>>>> - Maybe rename the variable as >> > >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR >> > >>>>>>>>>>>> for >> > >>>>>>>>>>>>>>>>>> simplicity and consistency with >> > SCOPE_NAMING_OPERATOR >> > >>>>>>>>> (which >> > >>>>>>>>>> is >> > >>>>>>>>>>>> not named >> > >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)? >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> Cheers, >> > >>>>>>>>>>>>>>>>>> Dong >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan < >> > >>>>>>>>>>> ruanhang1...@gmail.com> >> > >>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>>>>>> Hi all, >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce >> > metric group >> > >>>>>>>>> for >> > >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the >> > >>>>>>>>> coordinator >> > >>>>>>>>>>> for >> > >>>>>>>>>>>> runtime >> > >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The >> > coordination >> > >>>>>>>>>>>> mechanism is >> > >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator >> > and its all >> > >>>>>>>>>>>> operators, the >> > >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, >> for >> > >>>>>>>> example >> > >>>>>>>>>> many >> > >>>>>>>>>>>> Sources >> > >>>>>>>>>>>>>>>>> and >> > >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits >> and >> > >>>>>>>>>> coordinate >> > >>>>>>>>>>>> commits to >> > >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is >> > widely >> > >>>>>>>> using >> > >>>>>>>>> in >> > >>>>>>>>>>>> flink kafka >> > >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc >> > connector, >> > >>>>>>>>>> flink >> > >>>>>>>>>>>> hudi >> > >>>>>>>>>>>>>>>>>>> connector and so on. >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope >> > for the >> > >>>>>>>>>>>>>>>>> OperatorCoordinator >> > >>>>>>>>>>>>>>>>>>> and not an implementation for the interface >> > >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup. >> > >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be >> > how many >> > >>>>>>>>>>>> splits/partitions >> > >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many >> > files >> > >>>>>>>> have >> > >>>>>>>>>> been >> > >>>>>>>>>>>> written >> > >>>>>>>>>>>>>>>>> out >> > >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help >> > users to >> > >>>>>>>> know >> > >>>>>>>>>> the >> > >>>>>>>>>>>> job >> > >>>>>>>>>>>>>>>>> progress >> > >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus >> we >> > >>>>>>>> propose >> > >>>>>>>>>> the >> > >>>>>>>>>>>> FLIP-274 >> > >>>>>>>>>>>>>>>>> to >> > >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for >> > >>>>>>>> OperatorCoordinator >> > >>>>>>>>>> and >> > >>>>>>>>>>>> provide >> > >>>>>>>>>>>>>>>>> an >> > >>>>>>>>>>>>>>>>>>> internal implementation for >> > >>>>>>>>> OperatorCoordinatorMetricGroup. >> > >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get >> > time? Any >> > >>>>>>>>>>> feedback >> > >>>>>>>>>>>> is >> > >>>>>>>>>>>>>>>>>>> appreciated! >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> Best, >> > >>>>>>>>>>>>>>>>>>> Hang >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> [1] >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator >> > >> >> > >> >