Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Chesnay Schepler Tue, 31 Jan 2023 04:09:20 -0800

> IIUC, there is only an OperatorCoordinatorMetricGroup interfacewithout implementations (it never worked). There were no coordinatormetrics implemented and registered. @Hang Ruan<mailto:ruanhang1...@gmail.com> please correct me if I'm wrong.

Indeed. There are interfaces, and theoretically a metric group isexposed to user-code via the SplitEnumeratorContext, but the metricgroup is always null in the SourceCoordinatorContext. So we shouldn'thave to worry about backwards-compatibility.

> However, the config key "metrics.scope.jm.job" is deprecated and isreplaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" isreplaced with "metrics.scope.tm-job". That's why I though"metrics.scope.jm-operator" might be more consistent.

HA! Good catch. I only looked at the 1.16 docs where they weren'tdeprecated yet.


Then it should be "jm-operator" indeed.

On 31/01/2023 10:51, Jark Wu wrote:

Hi Chesnay,

> The new option must default to the configured value ofmetrics.scope.jm.job because otherwise it isn't backwards-compatiblewith existing coordinator metrics.IIUC, there is only an OperatorCoordinatorMetricGroup interfacewithout implementations (it never worked). There were no coordinatormetrics implemented and registered. @Hang Ruan<mailto:ruanhang1...@gmail.com> please correct me if I'm wrong.

> Intuitively I think the config key should bemetrics.scope.jm.operator., similar to metrics.scope.jm.job.However, the config key "metrics.scope.jm.job" is deprecated and isreplaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job"is replaced with "metrics.scope.tm-job". That's why I though"metrics.scope.jm-operator" might be more consistent.

> We would only support operator_name, operator_id, task_name andtask_id (+ everything else from the jm/job groups of course).

+1

Best,
Jark



On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ches...@apache.org> wrote:

    The new option must default to the configured value of
    metrics.scope.jm.job because otherwise it isn't
    backwards-compatible with existing coordinator metrics.
    According to some earlier response in this thread there are
    already coordinator metrics.

    If that weren't the case you'd be correct, and we'd keep the
    default consistent with the default for tm operator metrics.

    Intuitively I think the config key should be
    metrics.scope.jm.operator., similar to metrics.scope.jm.job.
    For symmetry we should also add new metrics.scope.tm.operator/task
    and eventually phase out metrics.scope.operator.

    > Regarding "coordinator metrics don't *belong to* any
    tasks/vertexes", I mean the coordinator doesn't run on specific
    tasks, so it doesn't have runtime information of tasks.

    That is true. I intentionally only referred to task id/name, as
    things like subtask index or attempt ids don't make sense for
    coordinators as you said.
    We would only support operator_name, operator_id, task_name and
    task_id (+ everything else from the jm/job groups of course).
    (Maybe in the future attempt_num could be interesting though)

    On 30/01/2023 10:39, Jark Wu wrote:

    Hi Chesnay,

    IIUC, our only disagreement is whether to support task info in
    the metric format. But we all agree with the default metric
    format
    "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
    which doesn't include task info, right?

    Regarding "coordinator metrics don't *belong to* any
    tasks/vertexes", I mean the coordinator doesn't run on specific
    tasks, so it doesn't have runtime information of tasks. My
    concern about adding task info to the coordinator metric is that
    1) if we register coordinator metrics for every subtask, the
    number of coordinator metrics will be exploded when task
    parallelism is large. This is an overhead to JM & metric systems,
    and the duplicated metrics are useless for users.
    2) Correct me if I'm wrong, MetricGroup#counter only supports
    registering one metric. It seems it can't register a metric
    reporting for multiple metric names (e.g. multiple subtasks).

    Best,
    Jark


    On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler
    <ches...@apache.org> wrote:

        I hadn't looked in detail in to how the task info can be
        introduced, but
        given that the coordinates are created by the execution
        graph, where we
        only work on tasks, it should be possible and rather simple.

        The OperatorCoordinatorHolder gets access to the
        ExecutionJobVertex from
        which it can extract everything and create the task/operator
        groups.
        Exposure to the coordinator could happen via the coordinator
        context as
        originally planned.

         > As we know, coordinator metrics are just like
        JMJobMetricGroup, which
        don't belong to any tasks/vertexes.

        I don't think this is true, every coordinator is scoped to a
        specific
        operator, and every operator is associated with a particular
        vertex.

         From the javadocs:

        " A coordinator for runtime operators. The
        OperatorCoordinator runs on
        the master, associated with
          the job vertex of the operator. It communicates with
        operators via
        sending operator events."

        On 28/01/2023 09:15, Jark Wu wrote:
        > Hi all,
        >
        > IIUC, Chesnay means we should have a more general metric
        group for
        > **operator**
        > not for **coordinator** in JM, this would be useful to extend
        > other operator-specific
        > metrics in the future. That means the new scope format
        should be designed
        > for
        > the operator,
        > e.g.,
        metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
        > The coordinator metric is a subgroup (a constant
        "coordinator" suffix) of
        > the JMOperatorMG.
        >
        > I think this is a nice design. However, I have a question
        about adding task
        > id/name to this list.
        > How to get the task id/name when reporting a coordinator
        metric? As we
        > know, coordinator metrics
        > are just like JMJobMetricGroup, which don't belong to any
        tasks/vertexes.
        > Do you mean reporting
        > coordinator metrics for every task id under the operator?
        >
        > Best,
        > Jark
        >
        >
        > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler
        <ches...@apache.org> wrote:
        >
        >>   > First, I do not understand why users have to configure
        the new scope
        >> format, which has a default value.
        >>
        >> If you don't use scope formats, sure. If you do use scope
        formats, e.g.
        >> to add a common prefix (which is the case for datadog
        users for
        >> example), then the current default in the FLIP is
        insufficient and
        >> requires the user to update the configuration.
        >>
        >>   > I usually do not need to change these configuration of
        scope formats
        >> when submitting the flink job.
        >>
        >> Scope formats as a whole are quite a power-user feature,
        but that
        >> doesn't mean we should ignore it.
        >>
        >>   > I try to let it extend the ComponentMetricGroup
        >>
        >> This isn't inherently required just because it is a
        "component".
        >> Component metric groups should _only_ be used for cases
        where such a
        >> component provides several bits of metadata.
        >> For example, tasks provide vertex ids, task names,
        attempted IDs etc.,
        >> and we'd like users to have the option on how this
        metadata ends up
        >> being used (via scope formats). This currently can't be
        built with the
        >> addGroup() methods, hence why the component groups exist.
        >> However, the operator coordinator _itself_does not provide
        several bits
        >> of metadata. Logically, the _operator_ does, not the
        coordinator.
        >>
        >>   > This make me decide to add a new scope format.
        >>
        >> And this is fine; just don't add a new scope format for
        the coordinator
        >> but logically for operators on the JM side, such that we
        can extend the
        >> set of operator-specific metrics in the future without
        having to add yet
        >> another scope format.
        >>
        >>   > The relationship between the Operator and
        OperatorCoordinator is
        >> maintained in the OperatorCoordinatorHolder. In the POC,
        the operator
        >> name/id could be found and the
        OperatorCoordinatorMetricGroup will be
        >> created here.
        >>
        >> That's all irrelevant.
        >>
        >>   > The registered metrics of OperatorCoordinator always
        are those which
        >> can not be aggregated from the tasks, like the number of
        unassigned
        >> splits in the SourceCoordinator. Actually we have not
        encountered the
        >> scenario you mentioned. I think the
        OperatorCoordinatorMetricGroup
        >> should only contain the metrics for itself instead of its
        subtasks.
        >>
        >> You are misunderstanding the problem.
        >>
        >> This isn't about aggregating metrics on our side or
        exposing metrics
        >> from TMs on the JM side or anything like that.
        >>
        >> It's purely about the available metadata for operator
        metrics on the JM
        >> side. Currently you suggest operator name / id; what I'm
        proposing is to
        >> add task(==vertex!) id / name to this list.
        >>
        >> The benefit here is simple. I can look up all metrics
        where task_id==XXX
        >> and get _everything_ related to that vertex, including all
        metrics
        >> associated with operators that are part of that vertex,
        including the
        >> coordinator.
        >>
        >>   > Using metrics.scope.jm.job is not enough to
        distinguish the different
        >> coordinators.
        >>
        >> I did not suggest just using metrics.scope.jm.job. I
        suggested that as
        >> the default for the jm operator scope format, that is used
        for the
        >> proposed JobManagerOperatorMetricGroup, which will have
        some plain
        >> metric groups as children for coordinators.
        >> Aka, you have the JMOperatorMG, then you call
        addGroup("coordinators")
        >> and then addGroup(coordinator_name) for each coordinator.
        >>
        >>   > So there are two choice for the
        >> InternalOperatorCoordinatorMetricGroup: 1. add and improve
        the new scope
        >> format; 2. use the metrics.scope.jm.job and
        ProxyMetricGroup. Which one
        >> is better?
        >>
        >> There are more options than that.
        >> I feel like there are a lot of misunderstandings in this
        discussion.
        >> Please let me know if I could clear things up. If not I
        can also provide
        >> a PoC based on your PoC if that can speed things up. It's
        not that
        >> different anyway.
        >>
        >> On 19/01/2023 07:39, Hang Ruan wrote:
        >>> Hi, chesnay,
        >>>
        >>> Thanks for your reply. I still have some doubts about the
        questions
        >>> you raised.
        >>>
        >>>      > Extending the set of ScopeFormats is problematic
        because it in
        >>>      practice
        >>>      it breaks the config if users actively rely on it,
        since there's now
        >>>      another key that they _must_ set for it to be
        >>>      consistent/compatible with
        >>>      their existing setup.
        >>>      Unfortunately due to how powerful scope formats are
        we can't derive a
        >>>      default value that matches their existing setup.
        >>>      Hence we should try to do this as rarely as possible.
        >>>
        >>>
        >>>      > This FLIP does not adhere to that since it
        proposes a dedicated
        >>>      format
        >>>      for coordinators; next time we want to expose
        operator-specific
        >>>      metrics
        >>>      (e.g., in the scheduler) we'd have to add another
        one to support it.
        >>>
        >>>
        >>> First, I do not understand why users have to configure
        the new scope
        >>> format, which has a default value. I usually do not need
        to change
        >>> these configuration of scope formats when submitting the
        flink job.
        >>> IMO, the OperatorCoordinator is a component running at
        the JobMaster.
        >>> I try to let it extend the ComponentMetricGroup, but I
        cannot find
        >>> a suitable scope. This make me decide to add a new scope
        format.
        >>>
        >>>      > Additionally, the configurable variables (operator
        name/id) are
        >>>      logically not attached to the coordinator, but
        operators, so to me it
        >>>      just doesn't make sense to structure it like this.
        >>>
        >>>
        >>> The relationship between the Operator and
        OperatorCoordinator is
        >>> maintained in the OperatorCoordinatorHolder. In the POC,
        the operator
        >>> name/id could be found and the
        OperatorCoordinatorMetricGroup will be
        >>> created here.
        >>>
        >>>      > Another thing I'm concerned about is that, because
        we don't include
        >>>      tasks in the hierarchy, users wishing to collect all
        metrics for a
        >>>      particular task (in this case ==vertex) now have to go
        >>>      significantly out
        >>>      of their way to get them, since they can no longer
        just filter by the
        >>>      task ID but have to be filter for _all_ operators
        that are part of
        >>>      the task.
        >>>
        >>>
        >>> The registered metrics of OperatorCoordinator always are
        those which
        >>> can not be aggregated from the tasks, like the number of
        unassigned
        >>> splits in the SourceCoordinator. Actually we have not
        encountered the
        >>> scenario you mentioned. I think the
        OperatorCoordinatorMetricGroup
        >>> should only contain the metrics for itself instead of its
        subtasks.
        >>>
        >>> Using metrics.scope.jm.job is not enough to distinguish
        the different
        >>> coordinators. The operator id/name is necessary. Then
        >>> the implementation should be like this. But users can not
        change the
        >>> scope in this way. This is also acceptable.
        >>> @Internal public class InternalOperatorCoordinatorMetricGroup
        >>>           extends ProxyMetricGroup<MetricGroup>
        >>>           implements OperatorCoordinatorMetricGroup {
        >>>       public InternalOperatorCoordinatorMetricGroup(
        >>>               JobManagerJobMetricGroup parent, OperatorID
        operatorID,
        >> String operatorName) {
        >>>  super(parent.addGroup(operatorID +
        >> operatorName).addGroup("coordinator")); }
        >>> }
        >>> So there are two choice for the
        >>> InternalOperatorCoordinatorMetricGroup: 1. add and
        improve the new
        >>> scope format; 2. use the metrics.scope.jm.job and
        ProxyMetricGroup.
        >>> Which one is better?
        >>>
        >>> Best,
        >>> Hang
        >>>
        >>>
        >>> Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三
        17:03写道：
        >>>
        >>>      You're misunderstanding the problem.
        >>>
        >>>      Metric groups form a tree with each group providing
        certain metadata.
        >>>
        >>>      E.g., on the taskmanager we have a TM metric group
        that provides info
        >>>      about the TM, that has child task metric groups,
        that have child
        >>>      operator metric groups etc. The operator metric
        group again has
        >>>      children, where we are now mostly in the
        user/connector land, like a
        >>>      kafka metric group providing partition info or
        something.
        >>>
        >>>      What is being proposed is to create a coordinator
        metric group on
        >>>      the JM
        >>>      side that provides operator metadata.
        >>>
        >>>      A more appropriate structure is to create an
        operator group on the JM
        >>>      side that provides this info, with the coordinator
        metric group
        >>>      being a
        >>>      child.
        >>>
        >>>      On 17/01/2023 19:56, Steven Wu wrote:
        >>>      >> Additionally, the configurable variables
        (operator name/id) are
        >>>      > logically not attached to the coordinator, but
        operators, so to
        >>>      me it
        >>>      > just doesn't make sense to structure it like this.
        >>>      >
        >>>      > Chesnay, maybe we should clarify the terminology.
        To me,
        >>>      pperators (like
        >>>      > FLIP-27 source) can have two parts (coordinator and
        >>>      reader/subtask). I
        >>>      > think it is fine to include operator name/id for
        coordinator
        >>>      metrics.
        >>>      >
        >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
        >>>      <ches...@apache.org> wrote:
        >>>      >
        >>>      >> Slight correction: Using metrics.scope.jm.job as
        the default
        >>>      should be
        >>>      >> safe.
        >>>      >>
        >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
        >>>      >>> The proposed ScopeFormat is still problematic
        for a few reasons.
        >>>      >>>
        >>>      >>> Extending the set of ScopeFormats is problematic
        because it in
        >>>      >>> practice it breaks the config if users actively
        rely on it, since
        >>>      >>> there's now another key that they _must_ set for
        it to be
        >>>      >>> consistent/compatible with their existing setup.
        >>>      >>> Unfortunately due to how powerful scope formats
        are we can't
        >>>      derive a
        >>>      >>> default value that matches their existing setup.
        >>>      >>> Hence we should try to do this as rarely as
        possible.
        >>>      >>>
        >>>      >>> This FLIP does not adhere to that since it
        proposes a
        >>>      dedicated format
        >>>      >>> for coordinators; next time we want to expose
        operator-specific
        >>>      >>> metrics (e.g., in the scheduler) we'd have to
        add another one to
        >>>      >>> support it.
        >>>      >>>
        >>>      >>> Additionally, the configurable variables
        (operator name/id) are
        >>>      >>> logically not attached to the coordinator, but
        operators, so
        >>>      to me it
        >>>      >>> just doesn't make sense to structure it like this.
        >>>      >>>
        >>>      >>> Another thing I'm concerned about is that,
        because we don't
        >>>      include
        >>>      >>> tasks in the hierarchy, users wishing to collect
        all metrics for
        >> a
        >>>      >>> particular task (in this case ==vertex) now have
        to go
        >>>      significantly
        >>>      >>> out of their way to get them, since they can no
        longer just
        >>>      filter by
        >>>      >>> the task ID but have to be filter for _all_
        operators that are
        >>>      part of
        >>>      >>> the task.
        >>>      >>>
        >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
        >>>      >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
        >>>      >>>>
        >>>      >>>> Do you have time to help to review this FLIP
        again after the
        >>>      >>>> modification?
        >>>      >>>> Looking forward to your reply.
        >>>      >>>> This FLIP will add a new configuration for the
        >>>      >>>> OperatorCoordinatorMetricGroup scope format. It
        provides an
        >>>      internal
        >>>      >>>> implementation and is added as a component to the
        >>>      >>>> JobManagerJobMetricGroup.
        >>>      >>>> If something doesn't make sense, could you
        provide some
        >>>      advice? It
        >>>      >>>> will be
        >>>      >>>> very helpful. Thanks a lot for your help.
        >>>      >>>>
        >>>      >>>> Best,
        >>>      >>>> Hang
        >>>      >>>>
        >>>      >>>> Martijn Visser <martijnvis...@apache.org>
        于2023年1月11日周三
        >>>      16:34写道：
        >>>      >>>>
        >>>      >>>>
        >>>      >>>>> Hi Hang,
        >>>      >>>>>
        >>>      >>>>> I'm a bit surprised that this has gone to a
        vote, given that
        >>>      Chesnay
        >>>      >>>>> deliberately mentioned that he would vote
        against it as-is.
        >>>      I would
        >>>      >>>>> expect
        >>>      >>>>> that before going to a vote, he has had the
        opportunity to
        >>>      >>>>> participate in
        >>>      >>>>> this discussion.
        >>>      >>>>>
        >>>      >>>>> Best regards,
        >>>      >>>>>
        >>>      >>>>> Martijn
        >>>      >>>>>
        >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu
        <imj...@gmail.com>
        >>>      wrote:
        >>>      >>>>>
        >>>      >>>>>> Hi Dong,
        >>>      >>>>>>
        >>>      >>>>>> Regarding
        “SplitEnumeratorContext#metricGroup”, my only
        >>>      concern is
        >>>      >>>>>> that
        >>>      >>>>>> this is a core interface for the FLIP. It’s
        hard to tell how
        >>>      >>>>>> sources use
        >>>      >>>>>> metric group without mentioning this
        interface. Even if
        >>>      this is an
        >>>      >>>>> existing
        >>>      >>>>>> API, I think it’s worth introducing the
        interface again and
        >>>      declaring
        >>>      >>>>> that
        >>>      >>>>>> we will implement the interface instead of a
        no-op method
        >>>      in this
        >>>      >>>>>> FLIP.
        >>>      >>>>>>
        >>>      >>>>>> Anyway, this is a minor problem and shouldn’t
        block this
        >>>      FLIP. I’m
        >>>      >>>>>> +1 to
        >>>      >>>>>> start a vote.
        >>>      >>>>>>
        >>>      >>>>>> Best,
        >>>      >>>>>> Jark
        >>>      >>>>>>
        >>>      >>>>>>
        >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan
        <ruanhang1...@gmail.com> 写道：
        >>>      >>>>>>>
        >>>      >>>>>>> Hi, Jark and Dong,
        >>>      >>>>>>>
        >>>      >>>>>>> Thanks for your comments. Sorry for my late
        reply.
        >>>      >>>>>>>
        >>>      >>>>>>> For suggestion 1, I plan to implement the
        >>>      >>>>>>> SplitEnumeratorMetricGroup in
        >>>      >>>>>>> another issue, and it is not contained in
        this FLIP. I
        >>>      will add some
        >>>      >>>>>>> description about this part.
        >>>      >>>>>>> For suggestion 2, changes about
        >>>      OperatorCoordinator#metricGroup has
        >>>      >>>>>> already
        >>>      >>>>>>> been documented in the proposed change section.
        >>>      >>>>>>>
        >>>      >>>>>>> Best,
        >>>      >>>>>>> Hang
        >>>      >>>>>>>
        >>>      >>>>>>> Dong Lin <lindon...@gmail.com>
        于2023年1月1日周日 09:45写道：
        >>>      >>>>>>>
        >>>      >>>>>>>> Let me chime-in and add comments regarding
        the public
        >>>      interface
        >>>      >>>>> section.
        >>>      >>>>>>>> Please see my comments inline.
        >>>      >>>>>>>>
        >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
        >>>      <imj...@gmail.com> wrote:
        >>>      >>>>>>>>
        >>>      >>>>>>>>> Hi Hang,
        >>>      >>>>>>>>>
        >>>      >>>>>>>>> Thanks for driving this discussion. I
        think this is a
        >>>      very useful
        >>>      >>>>>> feature
        >>>      >>>>>>>>> for connectors.
        >>>      >>>>>>>>>
        >>>      >>>>>>>>> The FLIP looks quite good to me, and I
        just have two
        >>>      suggestions.
        >>>      >>>>>>>>>
        >>>      >>>>>>>>> 1. In the "Public Interface" section,
        mention that the
        >>>      >>>>>>>>> implementation
        >>>      >>>>>>>>> behavior of
        "SplitEnumeratorContext#metricGroup" is
        >>>      changed from
        >>>      >>>>>>>> returning
        >>>      >>>>>>>>> null to returning a concrete
        SplitEnumeratorMetricGroup
        >>>      instance.
        >>>      >>>>> Even
        >>>      >>>>>>>>> though the API is already there, the
        behavior change can
        >>>      also be
        >>>      >>>>>>>> considered
        >>>      >>>>>>>>> a public change.
        >>>      >>>>>>>>>
        >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an
        interface and
        >>>      this FLIP
        >>>      >>>>>>>> does
        >>>      >>>>>> not
        >>>      >>>>>>>> seem to change its semantics/behavior. The
        FLIP does
        >>>      change the
        >>>      >>>>>>>> implementation/behavior of
        >>> SourceCoordinatorContext#metricGroup,
        >>>      >>>>>>>> which
        >>>      >>>>>> is
        >>>      >>>>>>>> marked @Internal.
        >>>      >>>>>>>>
        >>>      >>>>>>>> Thus it might seem a bit weird to add in
        the public
        >> interface
        >>>      >>>>>>>> section
        >>>      >>>>>>>> saying that we change the interface
        >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
        >>>      >>>>>> from
        >>>      >>>>>>>> returning null to non-null object.
        >>>      >>>>>>>>
        >>>      >>>>>>>> 2. Mention the newly added interface of
        >>>      >>>>>> "OperatorCoordinator#metricGroup"
        >>>      >>>>>>>>> in the "Proposed Changes" section or
        "Public Interface"
        >>>      section. As
        >>>      >>>>> the
        >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely
        used in many
        >>>      connectors.
        >>>      >>>>>> Though
        >>>      >>>>>>>> it
        >>>      >>>>>>>>> is still an @Internal API, I think it is worth
        >>>      mentioning the
        >>>      >>>>>>>>> change
        >>>      >>>>> in
        >>>      >>>>>>>> the
        >>>      >>>>>>>>> FLIP.
        >>>      >>>>>>>> Since OperatorCoordinator is an internal
        API, it seems
        >>>      reasonable to
        >>>      >>>>>>>> explain it in the proposed change section.
        The FLIP seems
        >>>      to have
        >>>      >>>>>>>> documented this in point 5 of the proposed
        change section.
        >>>      >>>>>>>>
        >>>      >>>>>>>> BTW, if we think there are @internal
        classes that are
        >>>      important
        >>>      >>>>>>>> enough
        >>>      >>>>>> to
        >>>      >>>>>>>> be added in the public interface section,
        it might be
        >>>      useful to
        >>>      >>>>>> explicitly
        >>>      >>>>>>>> discuss this topic and document it in the
        "*What are the
        >>>      "public
        >>>      >>>>>>>> interfaces" of the project*" in this
        >>>      >>>>>>>> <
        >>>      >>>>>>>>
        >>>      >>
        >>>
        >>
        
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
        >>>      >>>>>>>> wiki.
        >>>      >>>>>>>>
        >>>      >>>>>>>>
        >>>      >>>>>>>>> Best,
        >>>      >>>>>>>>> Jark
        >>>      >>>>>>>>>
        >>>      >>>>>>>>>
        >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
        >>>      <ruanhang1...@gmail.com>
        >>>      >>>>>> wrote:
        >>> >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and
        Qingsheng.
        >>> >>>>>>>>>> After combining everyone's comments, the main
        concerns and
        >>>      >>>>>>>> corresponding
        >>> >>>>>>>>>> adjustments are as follows.
        >>> >>>>>>>>>>
        >>> >>>>>>>>>> Q1: Common metrics are not quite useful.
        >>> >>>>>>>>>> numEventsIn and numEventsOut counters will be
        removed
        >>>      from the
        >>> >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
        metrics do
        >> not
        >>> >>>>>>>>>> provide
        >>>      >>>>>>>>> enough
        >>> >>>>>>>>>> information for users. The users are more
        willing to
        >>>      get the
        >>> >>>>>>>>>> number
        >>>      >>>>> of
        >>> >>>>>>>>>> events of the specified type instead of the total
        >>>      number. And this
        >>>      >>>>>>>> metric
        >>> >>>>>>>>>> is calculated differently. The implementation
        could
        >>>      register the
        >>>      >>>>>> metric
        >>>      >>>>>>>>> by
        >>> >>>>>>>>>> themselves.
        >>> >>>>>>>>>>
        >>> >>>>>>>>>> Q2: This FLIP is overly complicated.
        >>> >>>>>>>>>> This FLIP will become concise after these
        modifications.
        >>> >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
        >>>      introduced into
        >>>      >>>>> Flink
        >>>      >>>>>>>> by
        >>> >>>>>>>>>> FLIP-179<
        >>> >>>>>>>>>>
        >>> >>>>>>>>>>
        >>>      >>
        >>>
        >>
        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
        >>> >>>>>>>>>>> .
        >>> >>>>>>>>>> And
        >>> >>>>>>>>>> this FLIP will not change it. This FLIP only
        provides a
        >>>      new metric
        >>>      >>>>>>>> option
        >>> >>>>>>>>>> and a new metric group scope. The changes in
        proposed
        >>>      changes
        >>>      >>>>> provide
        >>>      >>>>>>>> the
        >>> >>>>>>>>>> details about the modifications for the internal
        >>>      classes, which
        >>>      >>>>> might
        >>>      >>>>>>>>> make
        >>> >>>>>>>>>> it look complicated.
        >>> >>>>>>>>>>
        >>> >>>>>>>>>> Thanks for all the comments again. If there
        are no further
        >>> >>>>>>>>>> comments,
        >>>      >>>>>> we
        >>> >>>>>>>>>> plan to start the voting thread this week.
        >>> >>>>>>>>>>
        >>> >>>>>>>>>> Best,
        >>> >>>>>>>>>> Hang
        >>> >>>>>>>>>>
        >>> >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com>
        于2022年12月26日周一
        >>>      16:48写道：
        >>> >>>>>>>>>>
        >>> >>>>>>>>>>
        >>> >>>>>>>>>>> Thanks for the FLIP, Hang!
        >>> >>>>>>>>>>>
        >>> >>>>>>>>>>> This FLIP overall looks good to me. Actually
        I share
        >>>      the same
        >>>      >>>>> concern
        >>> >>>>>>>>>> with
        >>> >>>>>>>>>>> Zhu that numEventsIn and numEventsOut
        counters are not
        >>>      quite
        >>> >>>>>>>>>>> useful
        >>>      >>>>>>>> to
        >>> >>>>>>>>>> end
        >>> >>>>>>>>>>> users. OperatorEvent is a quite low-level
        abstraction,
        >>>      which
        >>>      >>>>> requires
        >>> >>>>>>>>>>> instantialization in order to be practical to
        users and
        >>> >>>>>>>>>>> developers,
        >>>      >>>>>>>> so
        >>> >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
        >>> >>>>>>>>>>>
        >>> >>>>>>>>>>> Best,
        >>> >>>>>>>>>>> Qingsheng
        >>> >>>>>>>>>>> Ververica (Alibaba)
        >>> >>>>>>>>>>>
        >>> >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
        >>>      <reed...@gmail.com>
        >>>      >>>>> wrote:
        >>> >>>>>>>>>>>> Hi Hang,
        >>> >>>>>>>>>>>> I still see no strong reason why we need
        >>> >>>>>>>>>>>> numEventsIn/numEventsOut
        >>> >>>>>>>>>>> metrics.
        >>> >>>>>>>>>>>> In the discussion in FLINK-29801, I can see
        the same
        >>>      concern
        >>> >>>>>>>>>>>> from
        >>> >>>>>>>>>> others.
        >>> >>>>>>>>>>>> So I prefer to exclude them from this FLIP
        to avoid
        >>> >>>>>>>>>>>> over-extending
        >>>      >>>>>>>>> the
        >>> >>>>>>>>>>>> scope.
        >>> >>>>>>>>>>>>
        >>> >>>>>>>>>>>> Thanks,
        >>> >>>>>>>>>>>> Zhu
        >>> >>>>>>>>>>>>
        >>> >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com>
        于2022年12月23日周五
        >>> >>>>>>>>>>>> 15:21写道：
        >>> >>>>>>>>>>>>> Hi everyone,
        >>> >>>>>>>>>>>>>
        >>> >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we
        should keep
        >>>      the common
        >>> >>>>>>>>>>> metrics,
        >>> >>>>>>>>>>>> which will help to observe incoming and outgoing
        >>>      events. What do
        >>>      >>>>>>>> you
        >>> >>>>>>>>>>> think,
        >>> >>>>>>>>>>>> @Zhu Zhu ?
        >>> >>>>>>>>>>>>> And @Chesnay, are there any other issues
        you are
        >>>      more concerned
        >>> >>>>>>>>>> about?
        >>> >>>>>>>>>>>> Looking forward to your reply.
        >>> >>>>>>>>>>>>> Thanks for all the comments. If there are
        no further
        >>> >>>>>>>>>>>>> comments, we
        >>> >>>>>>>>>> plan
        >>> >>>>>>>>>>>> to start the voting thread next week.
        >>> >>>>>>>>>>>>> Best,
        >>> >>>>>>>>>>>>> Hang
        >>> >>>>>>>>>>>>>
        >>> >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com>
        于2022年12月15日周四
        >>> >>>>>>>>>>>>> 16:49写道：
        >>> >>>>>>>>>>>>>> Hi, Zhu Zhu,
        >>> >>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>> Thanks for your feedback!
        >>> >>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>> The OperatorCoordinator implementations are
        >>>      different. And
        >>> >>>>>>>>>>>>>> their
        >>> >>>>>>>>>>>> metrics are much different too. We try to
        find the
        >> common
        >>> >>>>>>>>>>>> metrics
        >>>      >>>>>>>> and
        >>> >>>>>>>>>> put
        >>> >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup.
        If most
        >>>      developers
        >>>      >>>>>>>> think
        >>>      >>>>>>>>> we
        >>> >>>>>>>>>>> do
        >>> >>>>>>>>>>>> not need these common metrics, removing the
        common
        >>>      metrics is
        >>> >>>>>>>>>>>> also
        >>> >>>>>>>>>>>> acceptable.
        >>> >>>>>>>>>>>>>> Best,
        >>> >>>>>>>>>>>>>> Hang
        >>> >>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com>
        于2022年12月14日周三
        >>>      22:09写道：
        >>> >>>>>>>>>>>>>>> Hi Hang & MengYue,
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> Thanks for creating this FLIP!
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> I think it is very useful, mainly in two
        aspects:
        >>> >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to
        register metrics.
        >>> >>>>>>>>>>>>>>> Currently
        >>> >>>>>>>>>>>>>>> the coordinators has no way to do this.
        And operator
        >>>      >>>>>>>> coordinator
        >>> >>>>>>>>>>>>>>> metric group further enables the
        SplitEnumerator
        >>>      to have
        >>> >>>>>>>>>>>>>>> access
        >>> >>>>>>>>>>>>>>> to a registered metric group (via the
        existing public
        >>> >>>>>>>>>>>>>>> interface
        >>> >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()),
        which is
        >>>      null at the
        >>>      >>>>>>>>> moment.
        >>> >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
        >>>      metrics. A clear
        >>> >>>>>>>>>>>> definition
        >>> >>>>>>>>>>>>>>> makes it easy for users to find their wanted
        >>>      metrics. The
        >>> >>>>>>>>>> definition
        >>> >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
        >> multiple
        >>> >>>>>>>>>>>> OperatorCoordinators
        >>> >>>>>>>>>>>>>>> of the same kind. E.g. each
        SourceCoordinator may
        >>>      have its
        >>> >>>>>>>>>>>>>>> own
        >>> >>>>>>>>>>>>>>> numSourceSplits metric, these metrics
        should not
        >>>      be directly
        >>> >>>>>>>>>>> registered
        >>> >>>>>>>>>>>>>>> to the job metric group.
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity
        of the
        >>>      introduced
        >>>      >>>>>>>>> common
        >>> >>>>>>>>>>>> metrics
        >>> >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter.
        If there
        >>>      any case
        >>>      >>>>>>>> which
        >>> >>>>>>>>>>>> strongly
        >>> >>>>>>>>>>>>>>> requires them?
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
        >>> >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
        >> overkill
        >>> >>>>>>>>>>>>>>> Directly using the job metric group can
        result in
        >>>      metric
        >>>      >>>>>>>>> conflicts,
        >>> >>>>>>>>>>> as
        >>> >>>>>>>>>>>> mentioned
        >>> >>>>>>>>>>>>>>> in above #2.
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> Thanks,
        >>> >>>>>>>>>>>>>>> Zhu
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com>
        于2022年12月10日周六
        >>> >>>>>>>>>>>>>>> 14:16写道：
        >>> >>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>> Hi Chesney,
        >>> >>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>> Just to double check with you,
        >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
        >>> >>>>>>>>>>>> (annotated as
        >>> >>>>>>>>>>>>>>>> @PublicEvolving) has already been
        introduced into
        >>>      Flink by
        >>> >>>>>>>>>> FLIP-179
        >>> >>>>>>>>>>>>>>>> <
        >>>      >>
        >>>
        >>
        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
        >>> >>>>>>>>>>>>> .
        >>> >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you
        mean we
        >>>      should remove
        >>>      >>>>>>>>> this
        >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
        >>> >>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>> Regards,
        >>> >>>>>>>>>>>>>>>> Dong
        >>> >>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
        Schepler <
        >>> >>>>>>>>>>> ches...@apache.org>
        >>> >>>>>>>>>>>> wrote:
        >>> >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
        >>>      complicated. A
        >>> >>>>>>>>>>> dedicated
        >>> >>>>>>>>>>>>>>>>> coordinator MG implementation is
        overkill; it
        >>>      could just
        >>>      >>>>>>>>> re-use
        >>> >>>>>>>>>>> the
        >>> >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the
        same
        >>>      structure we
        >>>      >>>>>>>>> have
        >>> >>>>>>>>>> on
        >>> >>>>>>>>>>>> TMs,
        >>> >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
        >>> >>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
        required
        >>>      anyway,
        >>> >>>>>>>>>> because
        >>> >>>>>>>>>>>> all the
        >>> >>>>>>>>>>>>>>>>> example metrics you listed can be
        implemented on
        >>>      the TM
        >>>      >>>>>>>> side
        >>>      >>>>>>>>> +
        >>> >>>>>>>>>>>>>>>>> aggregating them in the external
        metrics backend.
        >>> >>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no
        one tries
        >>>      to pull a
        >>>      >>>>>>>>> fast
        >>> >>>>>>>>>>>> one on
        >>> >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is
        I'd be
        >>>      against it.
        >>> >>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
        >>> >>>>>>>>>>>>>>>>>> Hi Hang,
        >>> >>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks
        good and it
        >>>      is pretty
        >>> >>>>>>>>>>>> informative.
        >>> >>>>>>>>>>>>>>>>>> I have just two minor comments
        regarding names:
        >>> >>>>>>>>>>>>>>>>>> - Would it be useful to rename the
        config key as
        >>> >>>>>>>>>>>>>>>>>>
        *metrics.scope.jm.job.operator-coordinator* for
        >>>      >>>>>>>> consistency
        >>> >>>>>>>>>>> with
        >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
        >>> >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
        >>> >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
        >>> >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
        >>> >>>>>>>>>>>> for
        >>> >>>>>>>>>>>>>>>>>> simplicity and consistency with
        >>>      SCOPE_NAMING_OPERATOR
        >>>      >>>>>>>>> (which
        >>> >>>>>>>>>> is
        >>> >>>>>>>>>>>> not named
        >>> >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
        >>> >>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>> Cheers,
        >>> >>>>>>>>>>>>>>>>>> Dong
        >>> >>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
        >>> >>>>>>>>>>> ruanhang1...@gmail.com>
        >>> >>>>>>>>>>>> wrote:
        >>> >>>>>>>>>>>>>>>>>>> Hi all,
        >>> >>>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1]
        Introduce
        >>>      metric group
        >>>      >>>>>>>>> for
        >>> >>>>>>>>>>>>>>>>>>> OperatorCoordinator.
        OperatorCoordinator is the
        >>>      >>>>>>>>> coordinator
        >>> >>>>>>>>>>> for
        >>> >>>>>>>>>>>> runtime
        >>> >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
        >>>      coordination
        >>> >>>>>>>>>>>> mechanism is
        >>> >>>>>>>>>>>>>>>>>>> operator events between
        OperatorCoordinator
        >>>      and its all
        >>> >>>>>>>>>>>> operators, the
        >>> >>>>>>>>>>>>>>>>>>> coordination is more and more using
        in Flink, for
        >>>      >>>>>>>> example
        >>> >>>>>>>>>> many
        >>> >>>>>>>>>>>> Sources
        >>> >>>>>>>>>>>>>>>>> and
        >>> >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to
        assign splits
        >> and
        >>> >>>>>>>>>> coordinate
        >>> >>>>>>>>>>>> commits to
        >>> >>>>>>>>>>>>>>>>>>> external systems. The
        OperatorCoordinator is
        >>>      widely
        >>>      >>>>>>>> using
        >>>      >>>>>>>>> in
        >>> >>>>>>>>>>>> flink kafka
        >>> >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector,
        flink cdc
        >>>      connector,
        >>> >>>>>>>>>> flink
        >>> >>>>>>>>>>>> hudi
        >>> >>>>>>>>>>>>>>>>>>> connector and so on.
        >>> >>>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>> But there is not a suitable metric
        group scope
        >>>      for the
        >>> >>>>>>>>>>>>>>>>> OperatorCoordinator
        >>> >>>>>>>>>>>>>>>>>>> and not an implementation for the
        interface
        >>> >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
        >>> >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator
        could be
        >>>      how many
        >>> >>>>>>>>>>>> splits/partitions
        >>> >>>>>>>>>>>>>>>>>>> have been assigned to source readers,
        how many
        >>>      files
        >>>      >>>>>>>> have
        >>> >>>>>>>>>> been
        >>> >>>>>>>>>>>> written
        >>> >>>>>>>>>>>>>>>>> out
        >>> >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not
        only help
        >>>      users to
        >>>      >>>>>>>> know
        >>> >>>>>>>>>> the
        >>> >>>>>>>>>>>> job
        >>> >>>>>>>>>>>>>>>>> progress
        >>> >>>>>>>>>>>>>>>>>>> but also make big job maintaining
        easier. Thus we
        >>>      >>>>>>>> propose
        >>> >>>>>>>>>> the
        >>> >>>>>>>>>>>> FLIP-274
        >>> >>>>>>>>>>>>>>>>> to
        >>> >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
        >>>      >>>>>>>> OperatorCoordinator
        >>> >>>>>>>>>> and
        >>> >>>>>>>>>>>> provide
        >>> >>>>>>>>>>>>>>>>> an
        >>> >>>>>>>>>>>>>>>>>>> internal implementation for
        >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
        >>> >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when
        you get
        >>>      time? Any
        >>> >>>>>>>>>>> feedback
        >>> >>>>>>>>>>>> is
        >>> >>>>>>>>>>>>>>>>>>> appreciated!
        >>> >>>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>> Best,
        >>> >>>>>>>>>>>>>>>>>>> Hang
        >>> >>>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>> [1]
        >>> >>>>>>>>>>>>>>>>>>>
        >>> >>>>>>>>>>>>>>>>>>>
        >>>      >>
        >>>
        >>
        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
        >>>      >>
        >>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to