Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Jark Wu Sat, 28 Jan 2023 00:16:11 -0800

Hi all,

IIUC, Chesnay means we should have a more general metric group for
**operator**
not for **coordinator** in JM, this would be useful to extend
other operator-specific
metrics in the future. That means the new scope format should be designed
for
the operator,
e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
The coordinator metric is a subgroup (a constant "coordinator" suffix) of
the JMOperatorMG.


I think this is a nice design. However, I have a question about adding task
id/name to this list.
How to get the task id/name when reporting a coordinator metric? As we
know, coordinator metrics
are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
Do you mean reporting
coordinator metrics for every task id under the operator?

Best,
Jark


On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org> wrote:

>  > First, I do not understand why users have to configure the new scope
> format, which has a default value.
>
> If you don't use scope formats, sure. If you do use scope formats, e.g.
> to add a common prefix (which is the case for datadog users for
> example), then the current default in the FLIP is insufficient and
> requires the user to update the configuration.
>
>  > I usually do not need to change these configuration of scope formats
> when submitting the flink job.
>
> Scope formats as a whole are quite a power-user feature, but that
> doesn't mean we should ignore it.
>
>  > I try to let it extend the ComponentMetricGroup
>
> This isn't inherently required just because it is a "component".
> Component metric groups should _only_ be used for cases where such a
> component provides several bits of metadata.
> For example, tasks provide vertex ids, task names, attempted IDs etc.,
> and we'd like users to have the option on how this metadata ends up
> being used (via scope formats). This currently can't be built with the
> addGroup() methods, hence why the component groups exist.
> However, the operator coordinator _itself_does not provide several bits
> of metadata. Logically, the _operator_ does, not the coordinator.
>
>  > This make me decide to add a new scope format.
>
> And this is fine; just don't add a new scope format for the coordinator
> but logically for operators on the JM side, such that we can extend the
> set of operator-specific metrics in the future without having to add yet
> another scope format.
>
>  > The relationship between the Operator and OperatorCoordinator is
> maintained in the OperatorCoordinatorHolder. In the POC, the operator
> name/id could be found and the OperatorCoordinatorMetricGroup will be
> created here.
>
> That's all irrelevant.
>
>  > The registered metrics of OperatorCoordinator always are those which
> can not be aggregated from the tasks, like the number of unassigned
> splits in the SourceCoordinator. Actually we have not encountered the
> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> should only contain the metrics for itself instead of its subtasks.
>
> You are misunderstanding the problem.
>
> This isn't about aggregating metrics on our side or exposing metrics
> from TMs on the JM side or anything like that.
>
> It's purely about the available metadata for operator metrics on the JM
> side. Currently you suggest operator name / id; what I'm proposing is to
> add task(==vertex!) id / name to this list.
>
> The benefit here is simple. I can look up all metrics where task_id==XXX
> and get _everything_ related to that vertex, including all metrics
> associated with operators that are part of that vertex, including the
> coordinator.
>
>  > Using metrics.scope.jm.job is not enough to distinguish the different
> coordinators.
>
> I did not suggest just using metrics.scope.jm.job. I suggested that as
> the default for the jm operator scope format, that is used for the
> proposed JobManagerOperatorMetricGroup, which will have some plain
> metric groups as children for coordinators.
> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
> and then addGroup(coordinator_name) for each coordinator.
>
>  > So there are two choice for the
> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
> is better?
>
> There are more options than that.
> I feel like there are a lot of misunderstandings in this discussion.
> Please let me know if I could clear things up. If not I can also provide
> a PoC based on your PoC if that can speed things up. It's not that
> different anyway.
>
> On 19/01/2023 07:39, Hang Ruan wrote:
> > Hi, chesnay,
> >
> > Thanks for your reply. I still have some doubts about the questions
> > you raised.
> >
> >     > Extending the set of ScopeFormats is problematic because it in
> >     practice
> >     it breaks the config if users actively rely on it, since there's now
> >     another key that they _must_ set for it to be
> >     consistent/compatible with
> >     their existing setup.
> >     Unfortunately due to how powerful scope formats are we can't derive a
> >     default value that matches their existing setup.
> >     Hence we should try to do this as rarely as possible.
> >
> >
> >     > This FLIP does not adhere to that since it proposes a dedicated
> >     format
> >     for coordinators; next time we want to expose operator-specific
> >     metrics
> >     (e.g., in the scheduler) we'd have to add another one to support it.
> >
> >
> > First, I do not understand why users have to configure the new scope
> > format, which has a default value. I usually do not need to change
> > these configuration of scope formats when submitting the flink job.
> > IMO, the OperatorCoordinator is a component running at the JobMaster.
> > I try to let it extend the ComponentMetricGroup, but I cannot find
> > a suitable scope. This make me decide to add a new scope format.
> >
> >     > Additionally, the configurable variables (operator name/id) are
> >     logically not attached to the coordinator, but operators, so to me it
> >     just doesn't make sense to structure it like this.
> >
> >
> > The relationship between the Operator and OperatorCoordinator is
> > maintained in the OperatorCoordinatorHolder. In the POC, the operator
> > name/id could be found and the OperatorCoordinatorMetricGroup will be
> > created here.
> >
> >     > Another thing I'm concerned about is that, because we don't include
> >     tasks in the hierarchy, users wishing to collect all metrics for a
> >     particular task (in this case ==vertex) now have to go
> >     significantly out
> >     of their way to get them, since they can no longer just filter by the
> >     task ID but have to be filter for _all_ operators that are part of
> >     the task.
> >
> >
> > The registered metrics of OperatorCoordinator always are those which
> > can not be aggregated from the tasks, like the number of unassigned
> > splits in the SourceCoordinator. Actually we have not encountered the
> > scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> > should only contain the metrics for itself instead of its subtasks.
> >
> > Using metrics.scope.jm.job is not enough to distinguish the different
> > coordinators. The operator id/name is necessary. Then
> > the implementation should be like this. But users can not change the
> > scope in this way. This is also acceptable.
> > @Internal public class InternalOperatorCoordinatorMetricGroup
> >          extends ProxyMetricGroup<MetricGroup>
> >          implements OperatorCoordinatorMetricGroup {
> >      public InternalOperatorCoordinatorMetricGroup(
> >              JobManagerJobMetricGroup parent, OperatorID operatorID,
> String operatorName) {
> >          super(parent.addGroup(operatorID +
> operatorName).addGroup("coordinator")); }
> > }
> > So there are two choice for the
> > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
> > Which one is better?
> >
> > Best,
> > Hang
> >
> >
> > Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道：
> >
> >     You're misunderstanding the problem.
> >
> >     Metric groups form a tree with each group providing certain metadata.
> >
> >     E.g., on the taskmanager we have a TM metric group that provides info
> >     about the TM, that has child task metric groups, that have child
> >     operator metric groups etc. The operator metric group again has
> >     children, where we are now mostly in the user/connector land, like a
> >     kafka metric group providing partition info or something.
> >
> >     What is being proposed is to create a coordinator metric group on
> >     the JM
> >     side that provides operator metadata.
> >
> >     A more appropriate structure is to create an operator group on the JM
> >     side that provides this info, with the coordinator metric group
> >     being a
> >     child.
> >
> >     On 17/01/2023 19:56, Steven Wu wrote:
> >     >> Additionally, the configurable variables (operator name/id) are
> >     > logically not attached to the coordinator, but operators, so to
> >     me it
> >     > just doesn't make sense to structure it like this.
> >     >
> >     > Chesnay, maybe we should clarify the terminology. To me,
> >     pperators (like
> >     > FLIP-27 source) can have two parts (coordinator and
> >     reader/subtask). I
> >     > think it is fine to include operator name/id for coordinator
> >     metrics.
> >     >
> >     > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
> >     <ches...@apache.org> wrote:
> >     >
> >     >> Slight correction: Using metrics.scope.jm.job as the default
> >     should be
> >     >> safe.
> >     >>
> >     >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >     >>> The proposed ScopeFormat is still problematic for a few reasons.
> >     >>>
> >     >>> Extending the set of ScopeFormats is problematic because it in
> >     >>> practice it breaks the config if users actively rely on it, since
> >     >>> there's now another key that they _must_ set for it to be
> >     >>> consistent/compatible with their existing setup.
> >     >>> Unfortunately due to how powerful scope formats are we can't
> >     derive a
> >     >>> default value that matches their existing setup.
> >     >>> Hence we should try to do this as rarely as possible.
> >     >>>
> >     >>> This FLIP does not adhere to that since it proposes a
> >     dedicated format
> >     >>> for coordinators; next time we want to expose operator-specific
> >     >>> metrics (e.g., in the scheduler) we'd have to add another one to
> >     >>> support it.
> >     >>>
> >     >>> Additionally, the configurable variables (operator name/id) are
> >     >>> logically not attached to the coordinator, but operators, so
> >     to me it
> >     >>> just doesn't make sense to structure it like this.
> >     >>>
> >     >>> Another thing I'm concerned about is that, because we don't
> >     include
> >     >>> tasks in the hierarchy, users wishing to collect all metrics for
> a
> >     >>> particular task (in this case ==vertex) now have to go
> >     significantly
> >     >>> out of their way to get them, since they can no longer just
> >     filter by
> >     >>> the task ID but have to be filter for _all_ operators that are
> >     part of
> >     >>> the task.
> >     >>>
> >     >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >     >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
> >     >>>>
> >     >>>> Do you have time to help to review this FLIP again after the
> >     >>>> modification?
> >     >>>> Looking forward to your reply.
> >     >>>> This FLIP will add a new configuration for the
> >     >>>> OperatorCoordinatorMetricGroup scope format. It provides an
> >     internal
> >     >>>> implementation and is added as a component to the
> >     >>>> JobManagerJobMetricGroup.
> >     >>>> If something doesn't make sense, could you provide some
> >     advice? It
> >     >>>> will be
> >     >>>> very helpful. Thanks a lot for your help.
> >     >>>>
> >     >>>> Best,
> >     >>>> Hang
> >     >>>>
> >     >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三
> >     16:34写道：
> >     >>>>
> >     >>>>
> >     >>>>> Hi Hang,
> >     >>>>>
> >     >>>>> I'm a bit surprised that this has gone to a vote, given that
> >     Chesnay
> >     >>>>> deliberately mentioned that he would vote against it as-is.
> >     I would
> >     >>>>> expect
> >     >>>>> that before going to a vote, he has had the opportunity to
> >     >>>>> participate in
> >     >>>>> this discussion.
> >     >>>>>
> >     >>>>> Best regards,
> >     >>>>>
> >     >>>>> Martijn
> >     >>>>>
> >     >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com>
> >     wrote:
> >     >>>>>
> >     >>>>>> Hi Dong,
> >     >>>>>>
> >     >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
> >     concern is
> >     >>>>>> that
> >     >>>>>> this is a core interface for the FLIP. It’s hard to tell how
> >     >>>>>> sources use
> >     >>>>>> metric group without mentioning this interface. Even if
> >     this is an
> >     >>>>> existing
> >     >>>>>> API, I think it’s worth introducing the interface again and
> >     declaring
> >     >>>>> that
> >     >>>>>> we will implement the interface instead of a no-op method
> >     in this
> >     >>>>>> FLIP.
> >     >>>>>>
> >     >>>>>> Anyway, this is a minor problem and shouldn’t block this
> >     FLIP. I’m
> >     >>>>>> +1 to
> >     >>>>>> start a vote.
> >     >>>>>>
> >     >>>>>> Best,
> >     >>>>>> Jark
> >     >>>>>>
> >     >>>>>>
> >     >>>>>>> 2023年1月3日 10:03，Hang Ruan <ruanhang1...@gmail.com> 写道：
> >     >>>>>>>
> >     >>>>>>> Hi, Jark and Dong,
> >     >>>>>>>
> >     >>>>>>> Thanks for your comments. Sorry for my late reply.
> >     >>>>>>>
> >     >>>>>>> For suggestion 1, I plan to implement the
> >     >>>>>>> SplitEnumeratorMetricGroup in
> >     >>>>>>> another issue, and it is not contained in this FLIP. I
> >     will add some
> >     >>>>>>> description about this part.
> >     >>>>>>> For suggestion 2, changes about
> >     OperatorCoordinator#metricGroup has
> >     >>>>>> already
> >     >>>>>>> been documented in the proposed change section.
> >     >>>>>>>
> >     >>>>>>> Best,
> >     >>>>>>> Hang
> >     >>>>>>>
> >     >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道：
> >     >>>>>>>
> >     >>>>>>>> Let me chime-in and add comments regarding the public
> >     interface
> >     >>>>> section.
> >     >>>>>>>> Please see my comments inline.
> >     >>>>>>>>
> >     >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
> >     <imj...@gmail.com> wrote:
> >     >>>>>>>>
> >     >>>>>>>>> Hi Hang,
> >     >>>>>>>>>
> >     >>>>>>>>> Thanks for driving this discussion. I think this is a
> >     very useful
> >     >>>>>> feature
> >     >>>>>>>>> for connectors.
> >     >>>>>>>>>
> >     >>>>>>>>> The FLIP looks quite good to me, and I just have two
> >     suggestions.
> >     >>>>>>>>>
> >     >>>>>>>>> 1. In the "Public Interface" section, mention that the
> >     >>>>>>>>> implementation
> >     >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
> >     changed from
> >     >>>>>>>> returning
> >     >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
> >     instance.
> >     >>>>> Even
> >     >>>>>>>>> though the API is already there, the behavior change can
> >     also be
> >     >>>>>>>> considered
> >     >>>>>>>>> a public change.
> >     >>>>>>>>>
> >     >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
> >     this FLIP
> >     >>>>>>>> does
> >     >>>>>> not
> >     >>>>>>>> seem to change its semantics/behavior. The FLIP does
> >     change the
> >     >>>>>>>> implementation/behavior of
> >     SourceCoordinatorContext#metricGroup,
> >     >>>>>>>> which
> >     >>>>>> is
> >     >>>>>>>> marked @Internal.
> >     >>>>>>>>
> >     >>>>>>>> Thus it might seem a bit weird to add in the public
> interface
> >     >>>>>>>> section
> >     >>>>>>>> saying that we change the interface
> >     >>>>>>>> SplitEnumeratorContext#metricGroup
> >     >>>>>> from
> >     >>>>>>>> returning null to non-null object.
> >     >>>>>>>>
> >     >>>>>>>> 2. Mention the newly added interface of
> >     >>>>>> "OperatorCoordinator#metricGroup"
> >     >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
> >     section. As
> >     >>>>> the
> >     >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
> >     connectors.
> >     >>>>>> Though
> >     >>>>>>>> it
> >     >>>>>>>>> is still an @Internal API, I think it is worth
> >     mentioning the
> >     >>>>>>>>> change
> >     >>>>> in
> >     >>>>>>>> the
> >     >>>>>>>>> FLIP.
> >     >>>>>>>> Since OperatorCoordinator is an internal API, it seems
> >     reasonable to
> >     >>>>>>>> explain it in the proposed change section. The FLIP seems
> >     to have
> >     >>>>>>>> documented this in point 5 of the proposed change section.
> >     >>>>>>>>
> >     >>>>>>>> BTW, if we think there are @internal classes that are
> >     important
> >     >>>>>>>> enough
> >     >>>>>> to
> >     >>>>>>>> be added in the public interface section, it might be
> >     useful to
> >     >>>>>> explicitly
> >     >>>>>>>> discuss this topic and document it in the "*What are the
> >     "public
> >     >>>>>>>> interfaces" of the project*" in this
> >     >>>>>>>> <
> >     >>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >     >>>>>>>> wiki.
> >     >>>>>>>>
> >     >>>>>>>>
> >     >>>>>>>>> Best,
> >     >>>>>>>>> Jark
> >     >>>>>>>>>
> >     >>>>>>>>>
> >     >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
> >     <ruanhang1...@gmail.com>
> >     >>>>>> wrote:
> >     >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >     >>>>>>>>>> After combining everyone's comments, the main concerns and
> >     >>>>>>>> corresponding
> >     >>>>>>>>>> adjustments are as follows.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Q1: Common metrics are not quite useful.
> >     >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
> >     from the
> >     >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
> not
> >     >>>>>>>>>> provide
> >     >>>>>>>>> enough
> >     >>>>>>>>>> information for users. The users are more willing to
> >     get the
> >     >>>>>>>>>> number
> >     >>>>> of
> >     >>>>>>>>>> events of the specified type instead of the total
> >     number. And this
> >     >>>>>>>> metric
> >     >>>>>>>>>> is calculated differently. The implementation could
> >     register the
> >     >>>>>> metric
> >     >>>>>>>>> by
> >     >>>>>>>>>> themselves.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Q2: This FLIP is overly complicated.
> >     >>>>>>>>>> This FLIP will become concise after these modifications.
> >     >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
> >     introduced into
> >     >>>>> Flink
> >     >>>>>>>> by
> >     >>>>>>>>>> FLIP-179<
> >     >>>>>>>>>>
> >     >>>>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >     >>>>>>>>>>> .
> >     >>>>>>>>>> And
> >     >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
> >     new metric
> >     >>>>>>>> option
> >     >>>>>>>>>> and a new metric group scope. The changes in proposed
> >     changes
> >     >>>>> provide
> >     >>>>>>>> the
> >     >>>>>>>>>> details about the modifications for the internal
> >     classes, which
> >     >>>>> might
> >     >>>>>>>>> make
> >     >>>>>>>>>> it look complicated.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Thanks for all the comments again. If there are no further
> >     >>>>>>>>>> comments,
> >     >>>>>> we
> >     >>>>>>>>>> plan to start the voting thread this week.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Best,
> >     >>>>>>>>>> Hang
> >     >>>>>>>>>>
> >     >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一
> >     16:48写道：
> >     >>>>>>>>>>
> >     >>>>>>>>>>
> >     >>>>>>>>>>> Thanks for the FLIP, Hang!
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
> >     the same
> >     >>>>> concern
> >     >>>>>>>>>> with
> >     >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
> >     quite
> >     >>>>>>>>>>> useful
> >     >>>>>>>> to
> >     >>>>>>>>>> end
> >     >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
> >     which
> >     >>>>> requires
> >     >>>>>>>>>>> instantialization in order to be practical to users and
> >     >>>>>>>>>>> developers,
> >     >>>>>>>> so
> >     >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> Best,
> >     >>>>>>>>>>> Qingsheng
> >     >>>>>>>>>>> Ververica (Alibaba)
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
> >     <reed...@gmail.com>
> >     >>>>> wrote:
> >     >>>>>>>>>>>> Hi Hang,
> >     >>>>>>>>>>>> I still see no strong reason why we need
> >     >>>>>>>>>>>> numEventsIn/numEventsOut
> >     >>>>>>>>>>> metrics.
> >     >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
> >     concern
> >     >>>>>>>>>>>> from
> >     >>>>>>>>>> others.
> >     >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> >     >>>>>>>>>>>> over-extending
> >     >>>>>>>>> the
> >     >>>>>>>>>>>> scope.
> >     >>>>>>>>>>>>
> >     >>>>>>>>>>>> Thanks,
> >     >>>>>>>>>>>> Zhu
> >     >>>>>>>>>>>>
> >     >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五
> >     >>>>>>>>>>>> 15:21写道：
> >     >>>>>>>>>>>>> Hi everyone,
> >     >>>>>>>>>>>>>
> >     >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
> >     the common
> >     >>>>>>>>>>> metrics,
> >     >>>>>>>>>>>> which will help to observe incoming and outgoing
> >     events. What do
> >     >>>>>>>> you
> >     >>>>>>>>>>> think,
> >     >>>>>>>>>>>> @Zhu Zhu ?
> >     >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
> >     more concerned
> >     >>>>>>>>>> about?
> >     >>>>>>>>>>>> Looking forward to your reply.
> >     >>>>>>>>>>>>> Thanks for all the comments. If there are no further
> >     >>>>>>>>>>>>> comments, we
> >     >>>>>>>>>> plan
> >     >>>>>>>>>>>> to start the voting thread next week.
> >     >>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>
> >     >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四
> >     >>>>>>>>>>>>> 16:49写道：
> >     >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> Thanks for your feedback!
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> The OperatorCoordinator implementations are
> >     different. And
> >     >>>>>>>>>>>>>> their
> >     >>>>>>>>>>>> metrics are much different too. We try to find the
> common
> >     >>>>>>>>>>>> metrics
> >     >>>>>>>> and
> >     >>>>>>>>>> put
> >     >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
> >     developers
> >     >>>>>>>> think
> >     >>>>>>>>> we
> >     >>>>>>>>>>> do
> >     >>>>>>>>>>>> not need these common metrics, removing the common
> >     metrics is
> >     >>>>>>>>>>>> also
> >     >>>>>>>>>>>> acceptable.
> >     >>>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三
> >     22:09写道：
> >     >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> >     >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
> >     >>>>>>>>>>>>>>> Currently
> >     >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
> >     >>>>>>>> coordinator
> >     >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
> >     to have
> >     >>>>>>>>>>>>>>> access
> >     >>>>>>>>>>>>>>> to a registered metric group (via the existing public
> >     >>>>>>>>>>>>>>> interface
> >     >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
> >     null at the
> >     >>>>>>>>> moment.
> >     >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
> >     metrics. A clear
> >     >>>>>>>>>>>> definition
> >     >>>>>>>>>>>>>>> makes it easy for users to find their wanted
> >     metrics. The
> >     >>>>>>>>>> definition
> >     >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
> multiple
> >     >>>>>>>>>>>> OperatorCoordinators
> >     >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
> >     have its
> >     >>>>>>>>>>>>>>> own
> >     >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
> >     be directly
> >     >>>>>>>>>>> registered
> >     >>>>>>>>>>>>>>> to the job metric group.
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
> >     introduced
> >     >>>>>>>>> common
> >     >>>>>>>>>>>> metrics
> >     >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
> >     any case
> >     >>>>>>>> which
> >     >>>>>>>>>>>> strongly
> >     >>>>>>>>>>>>>>> requires them?
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >     >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
> overkill
> >     >>>>>>>>>>>>>>> Directly using the job metric group can result in
> >     metric
> >     >>>>>>>>> conflicts,
> >     >>>>>>>>>>> as
> >     >>>>>>>>>>>> mentioned
> >     >>>>>>>>>>>>>>> in above #2.
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Thanks,
> >     >>>>>>>>>>>>>>> Zhu
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六
> >     >>>>>>>>>>>>>>> 14:16写道：
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Hi Chesney,
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Just to double check with you,
> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >     >>>>>>>>>>>> (annotated as
> >     >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
> >     Flink by
> >     >>>>>>>>>> FLIP-179
> >     >>>>>>>>>>>>>>>> <
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >     >>>>>>>>>>>>> .
> >     >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
> >     should remove
> >     >>>>>>>>> this
> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Regards,
> >     >>>>>>>>>>>>>>>> Dong
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> >     >>>>>>>>>>> ches...@apache.org>
> >     >>>>>>>>>>>> wrote:
> >     >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> >     complicated. A
> >     >>>>>>>>>>> dedicated
> >     >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
> >     could just
> >     >>>>>>>>> re-use
> >     >>>>>>>>>>> the
> >     >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
> >     structure we
> >     >>>>>>>>> have
> >     >>>>>>>>>> on
> >     >>>>>>>>>>>> TMs,
> >     >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
> >     anyway,
> >     >>>>>>>>>> because
> >     >>>>>>>>>>>> all the
> >     >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
> >     the TM
> >     >>>>>>>> side
> >     >>>>>>>>> +
> >     >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
> >     to pull a
> >     >>>>>>>>> fast
> >     >>>>>>>>>>>> one on
> >     >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
> >     against it.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >     >>>>>>>>>>>>>>>>>> Hi Hang,
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
> >     is pretty
> >     >>>>>>>>>>>> informative.
> >     >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> >     >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >     >>>>>>>> consistency
> >     >>>>>>>>>>> with
> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >     >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >     >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >     >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >     >>>>>>>>>>>> for
> >     >>>>>>>>>>>>>>>>>> simplicity and consistency with
> >     SCOPE_NAMING_OPERATOR
> >     >>>>>>>>> (which
> >     >>>>>>>>>> is
> >     >>>>>>>>>>>> not named
> >     >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> Cheers,
> >     >>>>>>>>>>>>>>>>>> Dong
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >     >>>>>>>>>>> ruanhang1...@gmail.com>
> >     >>>>>>>>>>>> wrote:
> >     >>>>>>>>>>>>>>>>>>> Hi all,
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
> >     metric group
> >     >>>>>>>>> for
> >     >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> >     >>>>>>>>> coordinator
> >     >>>>>>>>>>> for
> >     >>>>>>>>>>>> runtime
> >     >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
> >     coordination
> >     >>>>>>>>>>>> mechanism is
> >     >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
> >     and its all
> >     >>>>>>>>>>>> operators, the
> >     >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
> >     >>>>>>>> example
> >     >>>>>>>>>> many
> >     >>>>>>>>>>>> Sources
> >     >>>>>>>>>>>>>>>>> and
> >     >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
> and
> >     >>>>>>>>>> coordinate
> >     >>>>>>>>>>>> commits to
> >     >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
> >     widely
> >     >>>>>>>> using
> >     >>>>>>>>> in
> >     >>>>>>>>>>>> flink kafka
> >     >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
> >     connector,
> >     >>>>>>>>>> flink
> >     >>>>>>>>>>>> hudi
> >     >>>>>>>>>>>>>>>>>>> connector and so on.
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
> >     for the
> >     >>>>>>>>>>>>>>>>> OperatorCoordinator
> >     >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
> >     >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >     >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
> >     how many
> >     >>>>>>>>>>>> splits/partitions
> >     >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
> >     files
> >     >>>>>>>> have
> >     >>>>>>>>>> been
> >     >>>>>>>>>>>> written
> >     >>>>>>>>>>>>>>>>> out
> >     >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
> >     users to
> >     >>>>>>>> know
> >     >>>>>>>>>> the
> >     >>>>>>>>>>>> job
> >     >>>>>>>>>>>>>>>>> progress
> >     >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> >     >>>>>>>> propose
> >     >>>>>>>>>> the
> >     >>>>>>>>>>>> FLIP-274
> >     >>>>>>>>>>>>>>>>> to
> >     >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >     >>>>>>>> OperatorCoordinator
> >     >>>>>>>>>> and
> >     >>>>>>>>>>>> provide
> >     >>>>>>>>>>>>>>>>> an
> >     >>>>>>>>>>>>>>>>>>> internal implementation for
> >     >>>>>>>>> OperatorCoordinatorMetricGroup.
> >     >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
> >     time? Any
> >     >>>>>>>>>>> feedback
> >     >>>>>>>>>>>> is
> >     >>>>>>>>>>>>>>>>>>> appreciated!
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> [1]
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >     >>
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to