Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Chesnay Schepler Thu, 02 Feb 2023 02:44:47 -0800

Looks good from my side.

On 01/02/2023 13:27, Hang Ruan wrote:

Hi all,


I make a mistake about the default value of the  scope format
JobManagerOperatorScopeFormat. It should be
<host>.jobmanager.<job_name>.<operator_name>.
I have fixed it in the FLIP.
And the variables (task id and task name) have been added to the list.

Best,
Hang


Hang Ruan <ruanhang1...@gmail.com> 于2023年2月1日周三 14:11写道：

Hi all,

IIUC, there is only an OperatorCoordinatorMetricGroup interface without

implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan please correct me if I'm wrong.
Yes,  we shouldn't have to worry about backwards-compatibility as Chesnay
said.

After the discussion, I think the FLIP should be modified as follow.
1.  Add new metric group JobManagerOperatorMetricGroup, whose scope
format is JobManagerOperatorScopeFormat. The config key of 
JobManagerOperatorScopeFormat
is metrics.scope.jm-operator, and the default value is
<host>.jobmanager.<job_name>.<task_name>.<operator_name>.
2. JobManagerOperatorMetricGroup will be sub components of
JobManagerJobMetricGroup.
3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
which registered under JobManagerOperatorMetricGroup by
addGroup("coordinator").

I have updated the FLIP-274. @Chesnay Schepler <ches...@apache.org> @Jark
Wu <imj...@gmail.com> , please help to take a look at the modified FLIP
and make sure that I do not miss  something.

If there are no more problems, we plan to start the voting thread later.

Best,
Hang

Chesnay Schepler <ches...@apache.org> 于2023年1月31日周二 20:09写道：

IIUC, there is only an OperatorCoordinatorMetricGroup interface without

implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan <ruanhang1...@gmail.com> please
correct me if I'm wrong.

Indeed. There are interfaces, and theoretically a metric group is exposed
to user-code via the SplitEnumeratorContext, but the metric group is always
null in the SourceCoordinatorContext. So we shouldn't have to worry about
backwards-compatibility.

However, the config key "metrics.scope.jm.job" is deprecated and is

replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
replaced with "metrics.scope.tm-job". That's why I though
"metrics.scope.jm-operator" might be more consistent.

HA! Good catch. I only looked at the 1.16 docs where they weren't
deprecated yet.

Then it should be "jm-operator" indeed.

On 31/01/2023 10:51, Jark Wu wrote:

Hi Chesnay,

The new option must default to the configured value of

metrics.scope.jm.job because otherwise it isn't backwards-compatible with
existing coordinator metrics.
IIUC, there is only an OperatorCoordinatorMetricGroup interface without
implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan <ruanhang1...@gmail.com> please
correct me if I'm wrong.

Intuitively I think the config key should be

metrics.scope.jm.operator., similar to metrics.scope.jm.job.
However, the config key "metrics.scope.jm.job" is deprecated and is
replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
replaced with "metrics.scope.tm-job". That's why I though
"metrics.scope.jm-operator" might be more consistent.

We would only support operator_name, operator_id, task_name and task_id

(+ everything else from the jm/job groups of course).
+1

Best,
Jark



On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ches...@apache.org>
wrote:

The new option must default to the configured value of
metrics.scope.jm.job because otherwise it isn't backwards-compatible with
existing coordinator metrics.
According to some earlier response in this thread there are already
coordinator metrics.

If that weren't the case you'd be correct, and we'd keep the default
consistent with the default for tm operator metrics.

Intuitively I think the config key should be metrics.scope.jm.operator.,
similar to metrics.scope.jm.job.
For symmetry we should also add new metrics.scope.tm.operator/task and
eventually phase out metrics.scope.operator.

Regarding "coordinator metrics don't *belong to* any tasks/vertexes",

I mean the coordinator doesn't run on specific tasks, so it doesn't have
runtime information of tasks.

That is true. I intentionally only referred to task id/name, as things
like subtask index or attempt ids don't make sense for coordinators as you
said.
We would only support operator_name, operator_id, task_name and task_id
(+ everything else from the jm/job groups of course).
(Maybe in the future attempt_num could be interesting though)

On 30/01/2023 10:39, Jark Wu wrote:

Hi Chesnay,

IIUC, our only disagreement is whether to support task info in the
metric format. But we all agree with the default metric format
"metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
which doesn't include task info, right?

Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
mean the coordinator doesn't run on specific tasks, so it doesn't have
runtime information of tasks. My concern about adding task info to the
coordinator metric is that
1) if we register coordinator metrics for every subtask, the number of
coordinator metrics will be exploded when task parallelism is large. This
is an overhead to JM & metric systems, and the duplicated metrics are
useless for users.
2) Correct me if I'm wrong, MetricGroup#counter only supports
registering one metric. It seems it can't register a metric reporting for
multiple metric names (e.g. multiple subtasks).

Best,
Jark


On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ches...@apache.org>
wrote:

I hadn't looked in detail in to how the task info can be introduced,
but
given that the coordinates are created by the execution graph, where we
only work on tasks, it should be possible and rather simple.

The OperatorCoordinatorHolder gets access to the ExecutionJobVertex
from
which it can extract everything and create the task/operator groups.
Exposure to the coordinator could happen via the coordinator context as
originally planned.

  > As we know, coordinator metrics are just like JMJobMetricGroup,
which
don't belong to any tasks/vertexes.

I don't think this is true, every coordinator is scoped to a specific
operator, and every operator is associated with a particular vertex.

  From the javadocs:

" A coordinator for runtime operators. The OperatorCoordinator runs on
the master, associated with
   the job vertex of the operator. It communicates with operators via
sending operator events."

On 28/01/2023 09:15, Jark Wu wrote:

Hi all,

IIUC, Chesnay means we should have a more general metric group for
**operator**
not for **coordinator** in JM, this would be useful to extend
other operator-specific
metrics in the future. That means the new scope format should be

designed

for
the operator,
e.g.,

metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>

The coordinator metric is a subgroup (a constant "coordinator"

suffix) of

the JMOperatorMG.

I think this is a nice design. However, I have a question about

adding task

id/name to this list.
How to get the task id/name when reporting a coordinator metric? As we
know, coordinator metrics
are just like JMJobMetricGroup, which don't belong to any

tasks/vertexes.

Do you mean reporting
coordinator metrics for every task id under the operator?

Best,
Jark


On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org>

wrote:

   > First, I do not understand why users have to configure the new

scope

format, which has a default value.

If you don't use scope formats, sure. If you do use scope formats,

e.g.

to add a common prefix (which is the case for datadog users for
example), then the current default in the FLIP is insufficient and
requires the user to update the configuration.

   > I usually do not need to change these configuration of scope

formats

when submitting the flink job.

Scope formats as a whole are quite a power-user feature, but that
doesn't mean we should ignore it.

   > I try to let it extend the ComponentMetricGroup

This isn't inherently required just because it is a "component".
Component metric groups should _only_ be used for cases where such a
component provides several bits of metadata.
For example, tasks provide vertex ids, task names, attempted IDs

etc.,

and we'd like users to have the option on how this metadata ends up
being used (via scope formats). This currently can't be built with

the

addGroup() methods, hence why the component groups exist.
However, the operator coordinator _itself_does not provide several

bits

of metadata. Logically, the _operator_ does, not the coordinator.

   > This make me decide to add a new scope format.

And this is fine; just don't add a new scope format for the

coordinator

but logically for operators on the JM side, such that we can extend

the

set of operator-specific metrics in the future without having to add

yet

another scope format.

   > The relationship between the Operator and OperatorCoordinator is
maintained in the OperatorCoordinatorHolder. In the POC, the operator
name/id could be found and the OperatorCoordinatorMetricGroup will be
created here.

That's all irrelevant.

   > The registered metrics of OperatorCoordinator always are those

which

can not be aggregated from the tasks, like the number of unassigned
splits in the SourceCoordinator. Actually we have not encountered the
scenario you mentioned. I think the OperatorCoordinatorMetricGroup
should only contain the metrics for itself instead of its subtasks.

You are misunderstanding the problem.

This isn't about aggregating metrics on our side or exposing metrics
from TMs on the JM side or anything like that.

It's purely about the available metadata for operator metrics on the

JM

side. Currently you suggest operator name / id; what I'm proposing

is to

add task(==vertex!) id / name to this list.

The benefit here is simple. I can look up all metrics where

task_id==XXX

and get _everything_ related to that vertex, including all metrics
associated with operators that are part of that vertex, including the
coordinator.

   > Using metrics.scope.jm.job is not enough to distinguish the

different

coordinators.

I did not suggest just using metrics.scope.jm.job. I suggested that

as

the default for the jm operator scope format, that is used for the
proposed JobManagerOperatorMetricGroup, which will have some plain
metric groups as children for coordinators.
Aka, you have the JMOperatorMG, then you call

addGroup("coordinators")

and then addGroup(coordinator_name) for each coordinator.

   > So there are two choice for the
InternalOperatorCoordinatorMetricGroup: 1. add and improve the new

scope

format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which

one

is better?

There are more options than that.
I feel like there are a lot of misunderstandings in this discussion.
Please let me know if I could clear things up. If not I can also

provide

a PoC based on your PoC if that can speed things up. It's not that
different anyway.

On 19/01/2023 07:39, Hang Ruan wrote:

Hi, chesnay,

Thanks for your reply. I still have some doubts about the questions
you raised.

      > Extending the set of ScopeFormats is problematic because it

in

      practice
      it breaks the config if users actively rely on it, since

there's now

      another key that they _must_ set for it to be
      consistent/compatible with
      their existing setup.
      Unfortunately due to how powerful scope formats are we can't

derive a

      default value that matches their existing setup.
      Hence we should try to do this as rarely as possible.


      > This FLIP does not adhere to that since it proposes a

dedicated

      format
      for coordinators; next time we want to expose operator-specific
      metrics
      (e.g., in the scheduler) we'd have to add another one to

support it.


First, I do not understand why users have to configure the new scope
format, which has a default value. I usually do not need to change
these configuration of scope formats when submitting the flink job.
IMO, the OperatorCoordinator is a component running at the

JobMaster.

I try to let it extend the ComponentMetricGroup, but I cannot find
a suitable scope. This make me decide to add a new scope format.

      > Additionally, the configurable variables (operator name/id)

are

      logically not attached to the coordinator, but operators, so

to me it

      just doesn't make sense to structure it like this.


The relationship between the Operator and OperatorCoordinator is
maintained in the OperatorCoordinatorHolder. In the POC, the

operator

name/id could be found and the OperatorCoordinatorMetricGroup will

be

created here.

      > Another thing I'm concerned about is that, because we don't

include

      tasks in the hierarchy, users wishing to collect all metrics

for a

      particular task (in this case ==vertex) now have to go
      significantly out
      of their way to get them, since they can no longer just filter

by the

      task ID but have to be filter for _all_ operators that are

part of

      the task.


The registered metrics of OperatorCoordinator always are those which
can not be aggregated from the tasks, like the number of unassigned
splits in the SourceCoordinator. Actually we have not encountered

the

scenario you mentioned. I think the OperatorCoordinatorMetricGroup
should only contain the metrics for itself instead of its subtasks.

Using metrics.scope.jm.job is not enough to distinguish the

different

coordinators. The operator id/name is necessary. Then
the implementation should be like this. But users can not change the
scope in this way. This is also acceptable.
@Internal public class InternalOperatorCoordinatorMetricGroup
           extends ProxyMetricGroup<MetricGroup>
           implements OperatorCoordinatorMetricGroup {
       public InternalOperatorCoordinatorMetricGroup(
               JobManagerJobMetricGroup parent, OperatorID

operatorID,

String operatorName) {

           super(parent.addGroup(operatorID +

operatorName).addGroup("coordinator")); }

}
So there are two choice for the
InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
Which one is better?

Best,
Hang


Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道：

      You're misunderstanding the problem.

      Metric groups form a tree with each group providing certain

metadata.

      E.g., on the taskmanager we have a TM metric group that

provides info

      about the TM, that has child task metric groups, that have

child

      operator metric groups etc. The operator metric group again has
      children, where we are now mostly in the user/connector land,

like a

      kafka metric group providing partition info or something.

      What is being proposed is to create a coordinator metric group

on

      the JM
      side that provides operator metadata.

      A more appropriate structure is to create an operator group on

the JM

      side that provides this info, with the coordinator metric group
      being a
      child.

      On 17/01/2023 19:56, Steven Wu wrote:
      >> Additionally, the configurable variables (operator name/id)

are

      > logically not attached to the coordinator, but operators, so

to

      me it
      > just doesn't make sense to structure it like this.
      >
      > Chesnay, maybe we should clarify the terminology. To me,
      pperators (like
      > FLIP-27 source) can have two parts (coordinator and
      reader/subtask). I
      > think it is fine to include operator name/id for coordinator
      metrics.
      >
      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
      <ches...@apache.org> wrote:
      >
      >> Slight correction: Using metrics.scope.jm.job as the default
      should be
      >> safe.
      >>
      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
      >>> The proposed ScopeFormat is still problematic for a few

reasons.

      >>>
      >>> Extending the set of ScopeFormats is problematic because

it in

      >>> practice it breaks the config if users actively rely on

it, since

      >>> there's now another key that they _must_ set for it to be
      >>> consistent/compatible with their existing setup.
      >>> Unfortunately due to how powerful scope formats are we

can't

      derive a
      >>> default value that matches their existing setup.
      >>> Hence we should try to do this as rarely as possible.
      >>>
      >>> This FLIP does not adhere to that since it proposes a
      dedicated format
      >>> for coordinators; next time we want to expose

operator-specific

      >>> metrics (e.g., in the scheduler) we'd have to add another

one to

      >>> support it.
      >>>
      >>> Additionally, the configurable variables (operator

name/id) are

      >>> logically not attached to the coordinator, but operators,

so

      to me it
      >>> just doesn't make sense to structure it like this.
      >>>
      >>> Another thing I'm concerned about is that, because we don't
      include
      >>> tasks in the hierarchy, users wishing to collect all

metrics for

      >>> particular task (in this case ==vertex) now have to go
      significantly
      >>> out of their way to get them, since they can no longer just
      filter by
      >>> the task ID but have to be filter for _all_ operators that

are

      part of
      >>> the task.
      >>>
      >>> On 16/01/2023 03:09, Hang Ruan wrote:
      >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
      >>>>
      >>>> Do you have time to help to review this FLIP again after

the

      >>>> modification?
      >>>> Looking forward to your reply.
      >>>> This FLIP will add a new configuration for the
      >>>> OperatorCoordinatorMetricGroup scope format. It provides

an

      internal
      >>>> implementation and is added as a component to the
      >>>> JobManagerJobMetricGroup.
      >>>> If something doesn't make sense, could you provide some
      advice? It
      >>>> will be
      >>>> very helpful. Thanks a lot for your help.
      >>>>
      >>>> Best,
      >>>> Hang
      >>>>
      >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三
      16:34写道：
      >>>>
      >>>>
      >>>>> Hi Hang,
      >>>>>
      >>>>> I'm a bit surprised that this has gone to a vote, given

that

      Chesnay
      >>>>> deliberately mentioned that he would vote against it

as-is.

      I would
      >>>>> expect
      >>>>> that before going to a vote, he has had the opportunity

to

      >>>>> participate in
      >>>>> this discussion.
      >>>>>
      >>>>> Best regards,
      >>>>>
      >>>>> Martijn
      >>>>>
      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <

imj...@gmail.com>

      wrote:
      >>>>>
      >>>>>> Hi Dong,
      >>>>>>
      >>>>>> Regarding “SplitEnumeratorContext#metricGroup���, my only
      concern is
      >>>>>> that
      >>>>>> this is a core interface for the FLIP. It’s hard to

tell how

      >>>>>> sources use
      >>>>>> metric group without mentioning this interface. Even if
      this is an
      >>>>> existing
      >>>>>> API, I think it’s worth introducing the interface again

and

      declaring
      >>>>> that
      >>>>>> we will implement the interface instead of a no-op

method

      in this
      >>>>>> FLIP.
      >>>>>>
      >>>>>> Anyway, this is a minor problem and shouldn’t block this
      FLIP. I’m
      >>>>>> +1 to
      >>>>>> start a vote.
      >>>>>>
      >>>>>> Best,
      >>>>>> Jark
      >>>>>>
      >>>>>>
      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ruanhang1...@gmail.com> 写道：
      >>>>>>>
      >>>>>>> Hi, Jark and Dong,
      >>>>>>>
      >>>>>>> Thanks for your comments. Sorry for my late reply.
      >>>>>>>
      >>>>>>> For suggestion 1, I plan to implement the
      >>>>>>> SplitEnumeratorMetricGroup in
      >>>>>>> another issue, and it is not contained in this FLIP. I
      will add some
      >>>>>>> description about this part.
      >>>>>>> For suggestion 2, changes about
      OperatorCoordinator#metricGroup has
      >>>>>> already
      >>>>>>> been documented in the proposed change section.
      >>>>>>>
      >>>>>>> Best,
      >>>>>>> Hang
      >>>>>>>
      >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道：
      >>>>>>>
      >>>>>>>> Let me chime-in and add comments regarding the public
      interface
      >>>>> section.
      >>>>>>>> Please see my comments inline.
      >>>>>>>>
      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
      <imj...@gmail.com> wrote:
      >>>>>>>>
      >>>>>>>>> Hi Hang,
      >>>>>>>>>
      >>>>>>>>> Thanks for driving this discussion. I think this is a
      very useful
      >>>>>> feature
      >>>>>>>>> for connectors.
      >>>>>>>>>
      >>>>>>>>> The FLIP looks quite good to me, and I just have two
      suggestions.
      >>>>>>>>>
      >>>>>>>>> 1. In the "Public Interface" section, mention that

the

      >>>>>>>>> implementation
      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
      changed from
      >>>>>>>> returning
      >>>>>>>>> null to returning a concrete

SplitEnumeratorMetricGroup

      instance.
      >>>>> Even
      >>>>>>>>> though the API is already there, the behavior change

can

      also be
      >>>>>>>> considered
      >>>>>>>>> a public change.
      >>>>>>>>>
      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
      this FLIP
      >>>>>>>> does
      >>>>>> not
      >>>>>>>> seem to change its semantics/behavior. The FLIP does
      change the
      >>>>>>>> implementation/behavior of
      SourceCoordinatorContext#metricGroup,
      >>>>>>>> which
      >>>>>> is
      >>>>>>>> marked @Internal.
      >>>>>>>>
      >>>>>>>> Thus it might seem a bit weird to add in the public

interface

      >>>>>>>> section
      >>>>>>>> saying that we change the interface
      >>>>>>>> SplitEnumeratorContext#metricGroup
      >>>>>> from
      >>>>>>>> returning null to non-null object.
      >>>>>>>>
      >>>>>>>> 2. Mention the newly added interface of
      >>>>>> "OperatorCoordinator#metricGroup"
      >>>>>>>>> in the "Proposed Changes" section or "Public

Interface"

      section. As
      >>>>> the
      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
      connectors.
      >>>>>> Though
      >>>>>>>> it
      >>>>>>>>> is still an @Internal API, I think it is worth
      mentioning the
      >>>>>>>>> change
      >>>>> in
      >>>>>>>> the
      >>>>>>>>> FLIP.
      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
      reasonable to
      >>>>>>>> explain it in the proposed change section. The FLIP

seems

      to have
      >>>>>>>> documented this in point 5 of the proposed change

section.

      >>>>>>>>
      >>>>>>>> BTW, if we think there are @internal classes that are
      important
      >>>>>>>> enough
      >>>>>> to
      >>>>>>>> be added in the public interface section, it might be
      useful to
      >>>>>> explicitly
      >>>>>>>> discuss this topic and document it in the "*What are

the

      "public
      >>>>>>>> interfaces" of the project*" in this
      >>>>>>>> <
      >>>>>>>>
      >>

https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

      >>>>>>>> wiki.
      >>>>>>>>
      >>>>>>>>
      >>>>>>>>> Best,
      >>>>>>>>> Jark
      >>>>>>>>>
      >>>>>>>>>
      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
      <ruanhang1...@gmail.com>
      >>>>>> wrote:
      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
      >>>>>>>>>> After combining everyone's comments, the main

concerns and

      >>>>>>>> corresponding
      >>>>>>>>>> adjustments are as follows.
      >>>>>>>>>>
      >>>>>>>>>> Q1: Common metrics are not quite useful.
      >>>>>>>>>> numEventsIn and numEventsOut counters will be

removed

      from the
      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common

metrics do

not

      >>>>>>>>>> provide
      >>>>>>>>> enough
      >>>>>>>>>> information for users. The users are more willing to
      get the
      >>>>>>>>>> number
      >>>>> of
      >>>>>>>>>> events of the specified type instead of the total
      number. And this
      >>>>>>>> metric
      >>>>>>>>>> is calculated differently. The implementation could
      register the
      >>>>>> metric
      >>>>>>>>> by
      >>>>>>>>>> themselves.
      >>>>>>>>>>
      >>>>>>>>>> Q2: This FLIP is overly complicated.
      >>>>>>>>>> This FLIP will become concise after these

modifications.

      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
      introduced into
      >>>>> Flink
      >>>>>>>> by
      >>>>>>>>>> FLIP-179<
      >>>>>>>>>>
      >>>>>>>>>>
      >>

https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics

      >>>>>>>>>>> .
      >>>>>>>>>> And
      >>>>>>>>>> this FLIP will not change it. This FLIP only

provides a

      new metric
      >>>>>>>> option
      >>>>>>>>>> and a new metric group scope. The changes in

proposed

      changes
      >>>>> provide
      >>>>>>>> the
      >>>>>>>>>> details about the modifications for the internal
      classes, which
      >>>>> might
      >>>>>>>>> make
      >>>>>>>>>> it look complicated.
      >>>>>>>>>>
      >>>>>>>>>> Thanks for all the comments again. If there are no

further

      >>>>>>>>>> comments,
      >>>>>> we
      >>>>>>>>>> plan to start the voting thread this week.
      >>>>>>>>>>
      >>>>>>>>>> Best,
      >>>>>>>>>> Hang
      >>>>>>>>>>
      >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一
      16:48写道：
      >>>>>>>>>>
      >>>>>>>>>>
      >>>>>>>>>>> Thanks for the FLIP, Hang!
      >>>>>>>>>>>
      >>>>>>>>>>> This FLIP overall looks good to me. Actually I

share

      the same
      >>>>> concern
      >>>>>>>>>> with
      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are

not

      quite
      >>>>>>>>>>> useful
      >>>>>>>> to
      >>>>>>>>>> end
      >>>>>>>>>>> users. OperatorEvent is a quite low-level

abstraction,

      which
      >>>>> requires
      >>>>>>>>>>> instantialization in order to be practical to

users and

      >>>>>>>>>>> developers,
      >>>>>>>> so
      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
      >>>>>>>>>>>
      >>>>>>>>>>> Best,
      >>>>>>>>>>> Qingsheng
      >>>>>>>>>>> Ververica (Alibaba)
      >>>>>>>>>>>
      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
      <reed...@gmail.com>
      >>>>> wrote:
      >>>>>>>>>>>> Hi Hang,
      >>>>>>>>>>>> I still see no strong reason why we need
      >>>>>>>>>>>> numEventsIn/numEventsOut
      >>>>>>>>>>> metrics.
      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the

same

      concern
      >>>>>>>>>>>> from
      >>>>>>>>>> others.
      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to

avoid

      >>>>>>>>>>>> over-extending
      >>>>>>>>> the
      >>>>>>>>>>>> scope.
      >>>>>>>>>>>>
      >>>>>>>>>>>> Thanks,
      >>>>>>>>>>>> Zhu
      >>>>>>>>>>>>
      >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五
      >>>>>>>>>>>> 15:21写道：
      >>>>>>>>>>>>> Hi everyone,
      >>>>>>>>>>>>>
      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should

keep

      the common
      >>>>>>>>>>> metrics,
      >>>>>>>>>>>> which will help to observe incoming and outgoing
      events. What do
      >>>>>>>> you
      >>>>>>>>>>> think,
      >>>>>>>>>>>> @Zhu Zhu ?
      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
      more concerned
      >>>>>>>>>> about?
      >>>>>>>>>>>> Looking forward to your reply.
      >>>>>>>>>>>>> Thanks for all the comments. If there are no

further

      >>>>>>>>>>>>> comments, we
      >>>>>>>>>> plan
      >>>>>>>>>>>> to start the voting thread next week.
      >>>>>>>>>>>>> Best,
      >>>>>>>>>>>>> Hang
      >>>>>>>>>>>>>
      >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com>

于2022年12月15日周四

      >>>>>>>>>>>>> 16:49写道：
      >>>>>>>>>>>>>> Hi, Zhu Zhu,
      >>>>>>>>>>>>>>
      >>>>>>>>>>>>>> Thanks for your feedback!
      >>>>>>>>>>>>>>
      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
      different. And
      >>>>>>>>>>>>>> their
      >>>>>>>>>>>> metrics are much different too. We try to find the

common

      >>>>>>>>>>>> metrics
      >>>>>>>> and
      >>>>>>>>>> put
      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If

most

      developers
      >>>>>>>> think
      >>>>>>>>> we
      >>>>>>>>>>> do
      >>>>>>>>>>>> not need these common metrics, removing the common
      metrics is
      >>>>>>>>>>>> also
      >>>>>>>>>>>> acceptable.
      >>>>>>>>>>>>>> Best,
      >>>>>>>>>>>>>> Hang
      >>>>>>>>>>>>>>
      >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三
      22:09写道：
      >>>>>>>>>>>>>>> Hi Hang & MengYue,
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> I think it is very useful, mainly in two

aspects:

      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register

metrics.

      >>>>>>>>>>>>>>> Currently
      >>>>>>>>>>>>>>> the coordinators has no way to do this. And

operator

      >>>>>>>> coordinator
      >>>>>>>>>>>>>>> metric group further enables the

SplitEnumerator

      to have
      >>>>>>>>>>>>>>> access
      >>>>>>>>>>>>>>> to a registered metric group (via the existing

public

      >>>>>>>>>>>>>>> interface
      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
      null at the
      >>>>>>>>> moment.
      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
      metrics. A clear
      >>>>>>>>>>>> definition
      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
      metrics. The
      >>>>>>>>>> definition
      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from

multiple

      >>>>>>>>>>>> OperatorCoordinators
      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator

may

      have its
      >>>>>>>>>>>>>>> own
      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should

not

      be directly
      >>>>>>>>>>> registered
      >>>>>>>>>>>>>>> to the job metric group.
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of

the

      introduced
      >>>>>>>>> common
      >>>>>>>>>>>> metrics
      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If

there

      any case
      >>>>>>>> which
      >>>>>>>>>>>> strongly
      >>>>>>>>>>>>>>> requires them?
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is

overkill

      >>>>>>>>>>>>>>> Directly using the job metric group can result

in

      metric
      >>>>>>>>> conflicts,
      >>>>>>>>>>> as
      >>>>>>>>>>>> mentioned
      >>>>>>>>>>>>>>> in above #2.
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> Thanks,
      >>>>>>>>>>>>>>> Zhu
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六
      >>>>>>>>>>>>>>> 14:16写道：
      >>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>> Hi Chesney,
      >>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>> Just to double check with you,
      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
      >>>>>>>>>>>> (annotated as
      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced

into

      Flink by
      >>>>>>>>>> FLIP-179
      >>>>>>>>>>>>>>>> <
      >>

https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics

      >>>>>>>>>>>>> .
      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
      should remove
      >>>>>>>>> this
      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
      >>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>> Regards,
      >>>>>>>>>>>>>>>> Dong
      >>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay

Schepler <

      >>>>>>>>>>> ches...@apache.org>
      >>>>>>>>>>>> wrote:
      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
      complicated. A
      >>>>>>>>>>> dedicated
      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
      could just
      >>>>>>>>> re-use
      >>>>>>>>>>> the
      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
      structure we
      >>>>>>>>> have
      >>>>>>>>>> on
      >>>>>>>>>>>> TMs,
      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
      >>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is

required

      anyway,
      >>>>>>>>>> because
      >>>>>>>>>>>> all the
      >>>>>>>>>>>>>>>>> example metrics you listed can be

implemented on

      the TM
      >>>>>>>> side
      >>>>>>>>> +
      >>>>>>>>>>>>>>>>> aggregating them in the external metrics

backend.

      >>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one

tries

      to pull a
      >>>>>>>>> fast
      >>>>>>>>>>>> one on
      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
      against it.
      >>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
      >>>>>>>>>>>>>>>>>> Hi Hang,
      >>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good

and it

      is pretty
      >>>>>>>>>>>> informative.
      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding

names:

      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config

key as

      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*

for

      >>>>>>>> consistency
      >>>>>>>>>>> with
      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
      >>>>>>>>>>>> for
      >>>>>>>>>>>>>>>>>> simplicity and consistency with
      SCOPE_NAMING_OPERATOR
      >>>>>>>>> (which
      >>>>>>>>>> is
      >>>>>>>>>>>> not named
      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
      >>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>> Cheers,
      >>>>>>>>>>>>>>>>>> Dong
      >>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
      >>>>>>>>>>> ruanhang1...@gmail.com>
      >>>>>>>>>>>> wrote:
      >>>>>>>>>>>>>>>>>>> Hi all,
      >>>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
      metric group
      >>>>>>>>> for
      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator

is the

      >>>>>>>>> coordinator
      >>>>>>>>>>> for
      >>>>>>>>>>>> runtime
      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
      coordination
      >>>>>>>>>>>> mechanism is
      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
      and its all
      >>>>>>>>>>>> operators, the
      >>>>>>>>>>>>>>>>>>> coordination is more and more using in

Flink, for

      >>>>>>>> example
      >>>>>>>>>> many
      >>>>>>>>>>>> Sources
      >>>>>>>>>>>>>>>>> and
      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign

splits

and

      >>>>>>>>>> coordinate
      >>>>>>>>>>>> commits to
      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator

is

      widely
      >>>>>>>> using
      >>>>>>>>> in
      >>>>>>>>>>>> flink kafka
      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink

cdc

      connector,
      >>>>>>>>>> flink
      >>>>>>>>>>>> hudi
      >>>>>>>>>>>>>>>>>>> connector and so on.
      >>>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group

scope

      for the
      >>>>>>>>>>>>>>>>> OperatorCoordinator
      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could

be

      how many
      >>>>>>>>>>>> splits/partitions
      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how

many

      files
      >>>>>>>> have
      >>>>>>>>>> been
      >>>>>>>>>>>> written
      >>>>>>>>>>>>>>>>> out
      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only

help

      users to
      >>>>>>>> know
      >>>>>>>>>> the
      >>>>>>>>>>>> job
      >>>>>>>>>>>>>>>>> progress
      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.

Thus we

      >>>>>>>> propose
      >>>>>>>>>> the
      >>>>>>>>>>>> FLIP-274
      >>>>>>>>>>>>>>>>> to
      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
      >>>>>>>> OperatorCoordinator
      >>>>>>>>>> and
      >>>>>>>>>>>> provide
      >>>>>>>>>>>>>>>>> an
      >>>>>>>>>>>>>>>>>>> internal implementation for
      >>>>>>>>> OperatorCoordinatorMetricGroup.
      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you

get

      time? Any
      >>>>>>>>>>> feedback
      >>>>>>>>>>>> is
      >>>>>>>>>>>>>>>>>>> appreciated!
      >>>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>> Best,
      >>>>>>>>>>>>>>>>>>> Hang
      >>>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>> [1]
      >>>>>>>>>>>>>>>>>>>
      >>>>>>>>>>>>>>>>>>>
      >>

https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator

>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to