Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Chesnay Schepler Thu, 19 Jan 2023 01:32:54 -0800

> First, I do not understand why users have to configure the new scopeformat, which has a default value.

If you don't use scope formats, sure. If you do use scope formats, e.g.to add a common prefix (which is the case for datadog users forexample), then the current default in the FLIP is insufficient andrequires the user to update the configuration.

> I usually do not need to change these configuration of scope formatswhen submitting the flink job.

Scope formats as a whole are quite a power-user feature, but thatdoesn't mean we should ignore it.


> I try to let it extend the ComponentMetricGroup

This isn't inherently required just because it is a "component".

Component metric groups should _only_ be used for cases where such acomponent provides several bits of metadata.For example, tasks provide vertex ids, task names, attempted IDs etc.,and we'd like users to have the option on how this metadata ends upbeing used (via scope formats). This currently can't be built with theaddGroup() methods, hence why the component groups exist.However, the operator coordinator _itself_does not provide several bitsof metadata. Logically, the _operator_ does, not the coordinator.


> This make me decide to add a new scope format.

And this is fine; just don't add a new scope format for the coordinatorbut logically for operators on the JM side, such that we can extend theset of operator-specific metrics in the future without having to add yetanother scope format.

> The relationship between the Operator and OperatorCoordinator ismaintained in the OperatorCoordinatorHolder. In the POC, the operatorname/id could be found and the OperatorCoordinatorMetricGroup will becreated here.


That's all irrelevant.

> The registered metrics of OperatorCoordinator always are those whichcan not be aggregated from the tasks, like the number of unassignedsplits in the SourceCoordinator. Actually we have not encountered thescenario you mentioned. I think the OperatorCoordinatorMetricGroupshould only contain the metrics for itself instead of its subtasks.


You are misunderstanding the problem.

This isn't about aggregating metrics on our side or exposing metricsfrom TMs on the JM side or anything like that.

It's purely about the available metadata for operator metrics on the JMside. Currently you suggest operator name / id; what I'm proposing is toadd task(==vertex!) id / name to this list.

The benefit here is simple. I can look up all metrics where task_id==XXXand get _everything_ related to that vertex, including all metricsassociated with operators that are part of that vertex, including thecoordinator.

> Using metrics.scope.jm.job is not enough to distinguish the differentcoordinators.

I did not suggest just using metrics.scope.jm.job. I suggested that asthe default for the jm operator scope format, that is used for theproposed JobManagerOperatorMetricGroup, which will have some plainmetric groups as children for coordinators.Aka, you have the JMOperatorMG, then you call addGroup("coordinators")and then addGroup(coordinator_name) for each coordinator.

> So there are two choice for theInternalOperatorCoordinatorMetricGroup: 1. add and improve the new scopeformat; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which oneis better?


There are more options than that.

I feel like there are a lot of misunderstandings in this discussion.Please let me know if I could clear things up. If not I can also providea PoC based on your PoC if that can speed things up. It's not thatdifferent anyway.


On 19/01/2023 07:39, Hang Ruan wrote:

Hi, chesnay,

Thanks for your reply. I still have some doubts about the questionsyou raised.


    > Extending the set of ScopeFormats is problematic because it in
    practice
    it breaks the config if users actively rely on it, since there's now
    another key that they _must_ set for it to be
    consistent/compatible with
    their existing setup.
    Unfortunately due to how powerful scope formats are we can't derive a
    default value that matches their existing setup.
    Hence we should try to do this as rarely as possible.


    > This FLIP does not adhere to that since it proposes a dedicated
    format
    for coordinators; next time we want to expose operator-specific
    metrics
    (e.g., in the scheduler) we'd have to add another one to support it.

First, I do not understand why users have to configure the new scopeformat, which has a default value. I usually do not need to changethese configuration of scope formats when submitting the flink job.IMO, the OperatorCoordinator is a component running at the JobMaster.I try to let it extend the ComponentMetricGroup, but I cannot finda suitable scope. This make me decide to add a new scope format.


    > Additionally, the configurable variables (operator name/id) are
    logically not attached to the coordinator, but operators, so to me it
    just doesn't make sense to structure it like this.

The relationship between the Operator and OperatorCoordinator ismaintained in the OperatorCoordinatorHolder. In the POC, the operatorname/id could be found and the OperatorCoordinatorMetricGroup will becreated here.


    > Another thing I'm concerned about is that, because we don't include
    tasks in the hierarchy, users wishing to collect all metrics for a
    particular task (in this case ==vertex) now have to go
    significantly out
    of their way to get them, since they can no longer just filter by the
    task ID but have to be filter for _all_ operators that are part of
    the task.

The registered metrics of OperatorCoordinator always are those whichcan not be aggregated from the tasks, like the number of unassignedsplits in the SourceCoordinator. Actually we have not encountered thescenario you mentioned. I think the OperatorCoordinatorMetricGroupshould only contain the metrics for itself instead of its subtasks.

Using metrics.scope.jm.job is not enough to distinguish the differentcoordinators. The operator id/name is necessary. Thenthe implementation should be like this. But users can not change thescope in this way. This is also acceptable.

@Internal public class InternalOperatorCoordinatorMetricGroup
         extends ProxyMetricGroup<MetricGroup>
         implements OperatorCoordinatorMetricGroup {
     public InternalOperatorCoordinatorMetricGroup(
             JobManagerJobMetricGroup parent, OperatorID operatorID, String 
operatorName) {
         super(parent.addGroup(operatorID + 
operatorName).addGroup("coordinator")); }
}

So there are two choice for theInternalOperatorCoordinatorMetricGroup: 1. add and improve the newscope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.Which one is better?


Best,
Hang


Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道：

    You're misunderstanding the problem.

    Metric groups form a tree with each group providing certain metadata.

    E.g., on the taskmanager we have a TM metric group that provides info
    about the TM, that has child task metric groups, that have child
    operator metric groups etc. The operator metric group again has
    children, where we are now mostly in the user/connector land, like a
    kafka metric group providing partition info or something.

    What is being proposed is to create a coordinator metric group on
    the JM
    side that provides operator metadata.

    A more appropriate structure is to create an operator group on the JM
    side that provides this info, with the coordinator metric group
    being a
    child.

    On 17/01/2023 19:56, Steven Wu wrote:
    >> Additionally, the configurable variables (operator name/id) are
    > logically not attached to the coordinator, but operators, so to
    me it
    > just doesn't make sense to structure it like this.
    >
    > Chesnay, maybe we should clarify the terminology. To me,
    pperators (like
    > FLIP-27 source) can have two parts (coordinator and
    reader/subtask). I
    > think it is fine to include operator name/id for coordinator
    metrics.
    >
    > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
    <ches...@apache.org> wrote:
    >
    >> Slight correction: Using metrics.scope.jm.job as the default
    should be
    >> safe.
    >>
    >> On 16/01/2023 10:18, Chesnay Schepler wrote:
    >>> The proposed ScopeFormat is still problematic for a few reasons.
    >>>
    >>> Extending the set of ScopeFormats is problematic because it in
    >>> practice it breaks the config if users actively rely on it, since
    >>> there's now another key that they _must_ set for it to be
    >>> consistent/compatible with their existing setup.
    >>> Unfortunately due to how powerful scope formats are we can't
    derive a
    >>> default value that matches their existing setup.
    >>> Hence we should try to do this as rarely as possible.
    >>>
    >>> This FLIP does not adhere to that since it proposes a
    dedicated format
    >>> for coordinators; next time we want to expose operator-specific
    >>> metrics (e.g., in the scheduler) we'd have to add another one to
    >>> support it.
    >>>
    >>> Additionally, the configurable variables (operator name/id) are
    >>> logically not attached to the coordinator, but operators, so
    to me it
    >>> just doesn't make sense to structure it like this.
    >>>
    >>> Another thing I'm concerned about is that, because we don't
    include
    >>> tasks in the hierarchy, users wishing to collect all metrics for a
    >>> particular task (in this case ==vertex) now have to go
    significantly
    >>> out of their way to get them, since they can no longer just
    filter by
    >>> the task ID but have to be filter for _all_ operators that are
    part of
    >>> the task.
    >>>
    >>> On 16/01/2023 03:09, Hang Ruan wrote:
    >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
    >>>>
    >>>> Do you have time to help to review this FLIP again after the
    >>>> modification?
    >>>> Looking forward to your reply.
    >>>> This FLIP will add a new configuration for the
    >>>> OperatorCoordinatorMetricGroup scope format. It provides an
    internal
    >>>> implementation and is added as a component to the
    >>>> JobManagerJobMetricGroup.
    >>>> If something doesn't make sense, could you provide some
    advice? It
    >>>> will be
    >>>> very helpful. Thanks a lot for your help.
    >>>>
    >>>> Best,
    >>>> Hang
    >>>>
    >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三
    16:34写道：
    >>>>
    >>>>
    >>>>> Hi Hang,
    >>>>>
    >>>>> I'm a bit surprised that this has gone to a vote, given that
    Chesnay
    >>>>> deliberately mentioned that he would vote against it as-is.
    I would
    >>>>> expect
    >>>>> that before going to a vote, he has had the opportunity to
    >>>>> participate in
    >>>>> this discussion.
    >>>>>
    >>>>> Best regards,
    >>>>>
    >>>>> Martijn
    >>>>>
    >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com>
    wrote:
    >>>>>
    >>>>>> Hi Dong,
    >>>>>>
    >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
    concern is
    >>>>>> that
    >>>>>> this is a core interface for the FLIP. It’s hard to tell how
    >>>>>> sources use
    >>>>>> metric group without mentioning this interface. Even if
    this is an
    >>>>> existing
    >>>>>> API, I think it’s worth introducing the interface again and
    declaring
    >>>>> that
    >>>>>> we will implement the interface instead of a no-op method
    in this
    >>>>>> FLIP.
    >>>>>>
    >>>>>> Anyway, this is a minor problem and shouldn’t block this
    FLIP. I’m
    >>>>>> +1 to
    >>>>>> start a vote.
    >>>>>>
    >>>>>> Best,
    >>>>>> Jark
    >>>>>>
    >>>>>>
    >>>>>>> 2023年1月3日 10:03，Hang Ruan <ruanhang1...@gmail.com> 写道：
    >>>>>>>
    >>>>>>> Hi, Jark and Dong,
    >>>>>>>
    >>>>>>> Thanks for your comments. Sorry for my late reply.
    >>>>>>>
    >>>>>>> For suggestion 1, I plan to implement the
    >>>>>>> SplitEnumeratorMetricGroup in
    >>>>>>> another issue, and it is not contained in this FLIP. I
    will add some
    >>>>>>> description about this part.
    >>>>>>> For suggestion 2, changes about
    OperatorCoordinator#metricGroup has
    >>>>>> already
    >>>>>>> been documented in the proposed change section.
    >>>>>>>
    >>>>>>> Best,
    >>>>>>> Hang
    >>>>>>>
    >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道：
    >>>>>>>
    >>>>>>>> Let me chime-in and add comments regarding the public
    interface
    >>>>> section.
    >>>>>>>> Please see my comments inline.
    >>>>>>>>
    >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
    <imj...@gmail.com> wrote:
    >>>>>>>>
    >>>>>>>>> Hi Hang,
    >>>>>>>>>
    >>>>>>>>> Thanks for driving this discussion. I think this is a
    very useful
    >>>>>> feature
    >>>>>>>>> for connectors.
    >>>>>>>>>
    >>>>>>>>> The FLIP looks quite good to me, and I just have two
    suggestions.
    >>>>>>>>>
    >>>>>>>>> 1. In the "Public Interface" section, mention that the
    >>>>>>>>> implementation
    >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
    changed from
    >>>>>>>> returning
    >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
    instance.
    >>>>> Even
    >>>>>>>>> though the API is already there, the behavior change can
    also be
    >>>>>>>> considered
    >>>>>>>>> a public change.
    >>>>>>>>>
    >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
    this FLIP
    >>>>>>>> does
    >>>>>> not
    >>>>>>>> seem to change its semantics/behavior. The FLIP does
    change the
    >>>>>>>> implementation/behavior of
    SourceCoordinatorContext#metricGroup,
    >>>>>>>> which
    >>>>>> is
    >>>>>>>> marked @Internal.
    >>>>>>>>
    >>>>>>>> Thus it might seem a bit weird to add in the public interface
    >>>>>>>> section
    >>>>>>>> saying that we change the interface
    >>>>>>>> SplitEnumeratorContext#metricGroup
    >>>>>> from
    >>>>>>>> returning null to non-null object.
    >>>>>>>>
    >>>>>>>> 2. Mention the newly added interface of
    >>>>>> "OperatorCoordinator#metricGroup"
    >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
    section. As
    >>>>> the
    >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
    connectors.
    >>>>>> Though
    >>>>>>>> it
    >>>>>>>>> is still an @Internal API, I think it is worth
    mentioning the
    >>>>>>>>> change
    >>>>> in
    >>>>>>>> the
    >>>>>>>>> FLIP.
    >>>>>>>> Since OperatorCoordinator is an internal API, it seems
    reasonable to
    >>>>>>>> explain it in the proposed change section. The FLIP seems
    to have
    >>>>>>>> documented this in point 5 of the proposed change section.
    >>>>>>>>
    >>>>>>>> BTW, if we think there are @internal classes that are
    important
    >>>>>>>> enough
    >>>>>> to
    >>>>>>>> be added in the public interface section, it might be
    useful to
    >>>>>> explicitly
    >>>>>>>> discuss this topic and document it in the "*What are the
    "public
    >>>>>>>> interfaces" of the project*" in this
    >>>>>>>> <
    >>>>>>>>
    >>
    
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
    >>>>>>>> wiki.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> Best,
    >>>>>>>>> Jark
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
    <ruanhang1...@gmail.com>
    >>>>>> wrote:
    >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
    >>>>>>>>>> After combining everyone's comments, the main concerns and
    >>>>>>>> corresponding
    >>>>>>>>>> adjustments are as follows.
    >>>>>>>>>>
    >>>>>>>>>> Q1: Common metrics are not quite useful.
    >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
    from the
    >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
    >>>>>>>>>> provide
    >>>>>>>>> enough
    >>>>>>>>>> information for users. The users are more willing to
    get the
    >>>>>>>>>> number
    >>>>> of
    >>>>>>>>>> events of the specified type instead of the total
    number. And this
    >>>>>>>> metric
    >>>>>>>>>> is calculated differently. The implementation could
    register the
    >>>>>> metric
    >>>>>>>>> by
    >>>>>>>>>> themselves.
    >>>>>>>>>>
    >>>>>>>>>> Q2: This FLIP is overly complicated.
    >>>>>>>>>> This FLIP will become concise after these modifications.
    >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
    introduced into
    >>>>> Flink
    >>>>>>>> by
    >>>>>>>>>> FLIP-179<
    >>>>>>>>>>
    >>>>>>>>>>
    >>
    
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
    >>>>>>>>>>> .
    >>>>>>>>>> And
    >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
    new metric
    >>>>>>>> option
    >>>>>>>>>> and a new metric group scope. The changes in proposed
    changes
    >>>>> provide
    >>>>>>>> the
    >>>>>>>>>> details about the modifications for the internal
    classes, which
    >>>>> might
    >>>>>>>>> make
    >>>>>>>>>> it look complicated.
    >>>>>>>>>>
    >>>>>>>>>> Thanks for all the comments again. If there are no further
    >>>>>>>>>> comments,
    >>>>>> we
    >>>>>>>>>> plan to start the voting thread this week.
    >>>>>>>>>>
    >>>>>>>>>> Best,
    >>>>>>>>>> Hang
    >>>>>>>>>>
    >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一
    16:48写道：
    >>>>>>>>>>
    >>>>>>>>>>
    >>>>>>>>>>> Thanks for the FLIP, Hang!
    >>>>>>>>>>>
    >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
    the same
    >>>>> concern
    >>>>>>>>>> with
    >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
    quite
    >>>>>>>>>>> useful
    >>>>>>>> to
    >>>>>>>>>> end
    >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
    which
    >>>>> requires
    >>>>>>>>>>> instantialization in order to be practical to users and
    >>>>>>>>>>> developers,
    >>>>>>>> so
    >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
    >>>>>>>>>>>
    >>>>>>>>>>> Best,
    >>>>>>>>>>> Qingsheng
    >>>>>>>>>>> Ververica (Alibaba)
    >>>>>>>>>>>
    >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
    <reed...@gmail.com>
    >>>>> wrote:
    >>>>>>>>>>>> Hi Hang,
    >>>>>>>>>>>> I still see no strong reason why we need
    >>>>>>>>>>>> numEventsIn/numEventsOut
    >>>>>>>>>>> metrics.
    >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
    concern
    >>>>>>>>>>>> from
    >>>>>>>>>> others.
    >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
    >>>>>>>>>>>> over-extending
    >>>>>>>>> the
    >>>>>>>>>>>> scope.
    >>>>>>>>>>>>
    >>>>>>>>>>>> Thanks,
    >>>>>>>>>>>> Zhu
    >>>>>>>>>>>>
    >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五
    >>>>>>>>>>>> 15:21写道：
    >>>>>>>>>>>>> Hi everyone,
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
    the common
    >>>>>>>>>>> metrics,
    >>>>>>>>>>>> which will help to observe incoming and outgoing
    events. What do
    >>>>>>>> you
    >>>>>>>>>>> think,
    >>>>>>>>>>>> @Zhu Zhu ?
    >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
    more concerned
    >>>>>>>>>> about?
    >>>>>>>>>>>> Looking forward to your reply.
    >>>>>>>>>>>>> Thanks for all the comments. If there are no further
    >>>>>>>>>>>>> comments, we
    >>>>>>>>>> plan
    >>>>>>>>>>>> to start the voting thread next week.
    >>>>>>>>>>>>> Best,
    >>>>>>>>>>>>> Hang
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四
    >>>>>>>>>>>>> 16:49写道：
    >>>>>>>>>>>>>> Hi, Zhu Zhu,
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> Thanks for your feedback!
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> The OperatorCoordinator implementations are
    different. And
    >>>>>>>>>>>>>> their
    >>>>>>>>>>>> metrics are much different too. We try to find the common
    >>>>>>>>>>>> metrics
    >>>>>>>> and
    >>>>>>>>>> put
    >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
    developers
    >>>>>>>> think
    >>>>>>>>> we
    >>>>>>>>>>> do
    >>>>>>>>>>>> not need these common metrics, removing the common
    metrics is
    >>>>>>>>>>>> also
    >>>>>>>>>>>> acceptable.
    >>>>>>>>>>>>>> Best,
    >>>>>>>>>>>>>> Hang
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三
    22:09写道：
    >>>>>>>>>>>>>>> Hi Hang & MengYue,
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> Thanks for creating this FLIP!
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
    >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
    >>>>>>>>>>>>>>> Currently
    >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
    >>>>>>>> coordinator
    >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
    to have
    >>>>>>>>>>>>>>> access
    >>>>>>>>>>>>>>> to a registered metric group (via the existing public
    >>>>>>>>>>>>>>> interface
    >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
    null at the
    >>>>>>>>> moment.
    >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
    metrics. A clear
    >>>>>>>>>>>> definition
    >>>>>>>>>>>>>>> makes it easy for users to find their wanted
    metrics. The
    >>>>>>>>>> definition
    >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
    >>>>>>>>>>>> OperatorCoordinators
    >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
    have its
    >>>>>>>>>>>>>>> own
    >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
    be directly
    >>>>>>>>>>> registered
    >>>>>>>>>>>>>>> to the job metric group.
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
    introduced
    >>>>>>>>> common
    >>>>>>>>>>>> metrics
    >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
    any case
    >>>>>>>> which
    >>>>>>>>>>>> strongly
    >>>>>>>>>>>>>>> requires them?
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
    >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
    >>>>>>>>>>>>>>> Directly using the job metric group can result in
    metric
    >>>>>>>>> conflicts,
    >>>>>>>>>>> as
    >>>>>>>>>>>> mentioned
    >>>>>>>>>>>>>>> in above #2.
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> Thanks,
    >>>>>>>>>>>>>>> Zhu
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六
    >>>>>>>>>>>>>>> 14:16写道：
    >>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> Hi Chesney,
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> Just to double check with you,
    >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
    >>>>>>>>>>>> (annotated as
    >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
    Flink by
    >>>>>>>>>> FLIP-179
    >>>>>>>>>>>>>>>> <
    >>
    
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
    >>>>>>>>>>>>> .
    >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
    should remove
    >>>>>>>>> this
    >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> Regards,
    >>>>>>>>>>>>>>>> Dong
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
    >>>>>>>>>>> ches...@apache.org>
    >>>>>>>>>>>> wrote:
    >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
    complicated. A
    >>>>>>>>>>> dedicated
    >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
    could just
    >>>>>>>>> re-use
    >>>>>>>>>>> the
    >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
    structure we
    >>>>>>>>> have
    >>>>>>>>>> on
    >>>>>>>>>>>> TMs,
    >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
    >>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
    anyway,
    >>>>>>>>>> because
    >>>>>>>>>>>> all the
    >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
    the TM
    >>>>>>>> side
    >>>>>>>>> +
    >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
    >>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
    to pull a
    >>>>>>>>> fast
    >>>>>>>>>>>> one on
    >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
    against it.
    >>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
    >>>>>>>>>>>>>>>>>> Hi Hang,
    >>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
    is pretty
    >>>>>>>>>>>> informative.
    >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
    >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
    >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
    >>>>>>>> consistency
    >>>>>>>>>>> with
    >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
    >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
    >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
    >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
    >>>>>>>>>>>> for
    >>>>>>>>>>>>>>>>>> simplicity and consistency with
    SCOPE_NAMING_OPERATOR
    >>>>>>>>> (which
    >>>>>>>>>> is
    >>>>>>>>>>>> not named
    >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
    >>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>> Cheers,
    >>>>>>>>>>>>>>>>>> Dong
    >>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
    >>>>>>>>>>> ruanhang1...@gmail.com>
    >>>>>>>>>>>> wrote:
    >>>>>>>>>>>>>>>>>>> Hi all,
    >>>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
    metric group
    >>>>>>>>> for
    >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
    >>>>>>>>> coordinator
    >>>>>>>>>>> for
    >>>>>>>>>>>> runtime
    >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
    coordination
    >>>>>>>>>>>> mechanism is
    >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
    and its all
    >>>>>>>>>>>> operators, the
    >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
    >>>>>>>> example
    >>>>>>>>>> many
    >>>>>>>>>>>> Sources
    >>>>>>>>>>>>>>>>> and
    >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
    >>>>>>>>>> coordinate
    >>>>>>>>>>>> commits to
    >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
    widely
    >>>>>>>> using
    >>>>>>>>> in
    >>>>>>>>>>>> flink kafka
    >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
    connector,
    >>>>>>>>>> flink
    >>>>>>>>>>>> hudi
    >>>>>>>>>>>>>>>>>>> connector and so on.
    >>>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
    for the
    >>>>>>>>>>>>>>>>> OperatorCoordinator
    >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
    >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
    >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
    how many
    >>>>>>>>>>>> splits/partitions
    >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
    files
    >>>>>>>> have
    >>>>>>>>>> been
    >>>>>>>>>>>> written
    >>>>>>>>>>>>>>>>> out
    >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
    users to
    >>>>>>>> know
    >>>>>>>>>> the
    >>>>>>>>>>>> job
    >>>>>>>>>>>>>>>>> progress
    >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
    >>>>>>>> propose
    >>>>>>>>>> the
    >>>>>>>>>>>> FLIP-274
    >>>>>>>>>>>>>>>>> to
    >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
    >>>>>>>> OperatorCoordinator
    >>>>>>>>>> and
    >>>>>>>>>>>> provide
    >>>>>>>>>>>>>>>>> an
    >>>>>>>>>>>>>>>>>>> internal implementation for
    >>>>>>>>> OperatorCoordinatorMetricGroup.
    >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
    time? Any
    >>>>>>>>>>> feedback
    >>>>>>>>>>>> is
    >>>>>>>>>>>>>>>>>>> appreciated!
    >>>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>> Best,
    >>>>>>>>>>>>>>>>>>> Hang
    >>>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>> [1]
    >>>>>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>>>>
    >>
    
https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
    >>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to