Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Hang Ruan Sat, 28 Jan 2023 01:11:58 -0800

Hi all,

Thanks for all the comments. Sorry for my late reply.


I think I understand most of the suggestions. I still have the same
question about the adding task id/name like Jark.
And I have refactored the POC like this (
https://github.com/ruanhang1993/flink/commit/321192d17afbb9d00285a78a121676ae5e371292),
whose scope has the task name. When we initialize the coordinator, it is
hard to get the task name like we do in the TaskMetricGroup.
I also found a weird thing
that ExecutionJobVertex#getTaskInformationOrBlobKey will use
the jobVertex.getName() as the task name.
But ExecutionJobVertex#createOperatorCoordinatorHolder also will use
the jobVertex.getName() as the operatorName
for LazyInitializedCoordinatorContext. Maybe the task name will
be duplicated to the operator name in the metric name.

Best,
Hang

Jark Wu <imj...@gmail.com> 于2023年1月28日周六 16:15写道：

> Hi all,
>
> IIUC, Chesnay means we should have a more general metric group for
> **operator**
> not for **coordinator** in JM, this would be useful to extend
> other operator-specific
> metrics in the future. That means the new scope format should be designed
> for
> the operator,
> e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> The coordinator metric is a subgroup (a constant "coordinator" suffix) of
> the JMOperatorMG.
>
> I think this is a nice design. However, I have a question about adding
> task id/name to this list.
> How to get the task id/name when reporting a coordinator metric? As we
> know, coordinator metrics
> are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
> Do you mean reporting
> coordinator metrics for every task id under the operator?
>
> Best,
> Jark
>
>
> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ches...@apache.org> wrote:
>
>>  > First, I do not understand why users have to configure the new scope
>> format, which has a default value.
>>
>> If you don't use scope formats, sure. If you do use scope formats, e.g.
>> to add a common prefix (which is the case for datadog users for
>> example), then the current default in the FLIP is insufficient and
>> requires the user to update the configuration.
>>
>>  > I usually do not need to change these configuration of scope formats
>> when submitting the flink job.
>>
>> Scope formats as a whole are quite a power-user feature, but that
>> doesn't mean we should ignore it.
>>
>>  > I try to let it extend the ComponentMetricGroup
>>
>> This isn't inherently required just because it is a "component".
>> Component metric groups should _only_ be used for cases where such a
>> component provides several bits of metadata.
>> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>> and we'd like users to have the option on how this metadata ends up
>> being used (via scope formats). This currently can't be built with the
>> addGroup() methods, hence why the component groups exist.
>> However, the operator coordinator _itself_does not provide several bits
>> of metadata. Logically, the _operator_ does, not the coordinator.
>>
>>  > This make me decide to add a new scope format.
>>
>> And this is fine; just don't add a new scope format for the coordinator
>> but logically for operators on the JM side, such that we can extend the
>> set of operator-specific metrics in the future without having to add yet
>> another scope format.
>>
>>  > The relationship between the Operator and OperatorCoordinator is
>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> created here.
>>
>> That's all irrelevant.
>>
>>  > The registered metrics of OperatorCoordinator always are those which
>> can not be aggregated from the tasks, like the number of unassigned
>> splits in the SourceCoordinator. Actually we have not encountered the
>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> should only contain the metrics for itself instead of its subtasks.
>>
>> You are misunderstanding the problem.
>>
>> This isn't about aggregating metrics on our side or exposing metrics
>> from TMs on the JM side or anything like that.
>>
>> It's purely about the available metadata for operator metrics on the JM
>> side. Currently you suggest operator name / id; what I'm proposing is to
>> add task(==vertex!) id / name to this list.
>>
>> The benefit here is simple. I can look up all metrics where task_id==XXX
>> and get _everything_ related to that vertex, including all metrics
>> associated with operators that are part of that vertex, including the
>> coordinator.
>>
>>  > Using metrics.scope.jm.job is not enough to distinguish the different
>> coordinators.
>>
>> I did not suggest just using metrics.scope.jm.job. I suggested that as
>> the default for the jm operator scope format, that is used for the
>> proposed JobManagerOperatorMetricGroup, which will have some plain
>> metric groups as children for coordinators.
>> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>> and then addGroup(coordinator_name) for each coordinator.
>>
>>  > So there are two choice for the
>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
>> is better?
>>
>> There are more options than that.
>> I feel like there are a lot of misunderstandings in this discussion.
>> Please let me know if I could clear things up. If not I can also provide
>> a PoC based on your PoC if that can speed things up. It's not that
>> different anyway.
>>
>> On 19/01/2023 07:39, Hang Ruan wrote:
>> > Hi, chesnay,
>> >
>> > Thanks for your reply. I still have some doubts about the questions
>> > you raised.
>> >
>> >     > Extending the set of ScopeFormats is problematic because it in
>> >     practice
>> >     it breaks the config if users actively rely on it, since there's now
>> >     another key that they _must_ set for it to be
>> >     consistent/compatible with
>> >     their existing setup.
>> >     Unfortunately due to how powerful scope formats are we can't derive
>> a
>> >     default value that matches their existing setup.
>> >     Hence we should try to do this as rarely as possible.
>> >
>> >
>> >     > This FLIP does not adhere to that since it proposes a dedicated
>> >     format
>> >     for coordinators; next time we want to expose operator-specific
>> >     metrics
>> >     (e.g., in the scheduler) we'd have to add another one to support it.
>> >
>> >
>> > First, I do not understand why users have to configure the new scope
>> > format, which has a default value. I usually do not need to change
>> > these configuration of scope formats when submitting the flink job.
>> > IMO, the OperatorCoordinator is a component running at the JobMaster.
>> > I try to let it extend the ComponentMetricGroup, but I cannot find
>> > a suitable scope. This make me decide to add a new scope format.
>> >
>> >     > Additionally, the configurable variables (operator name/id) are
>> >     logically not attached to the coordinator, but operators, so to me
>> it
>> >     just doesn't make sense to structure it like this.
>> >
>> >
>> > The relationship between the Operator and OperatorCoordinator is
>> > maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> > name/id could be found and the OperatorCoordinatorMetricGroup will be
>> > created here.
>> >
>> >     > Another thing I'm concerned about is that, because we don't
>> include
>> >     tasks in the hierarchy, users wishing to collect all metrics for a
>> >     particular task (in this case ==vertex) now have to go
>> >     significantly out
>> >     of their way to get them, since they can no longer just filter by
>> the
>> >     task ID but have to be filter for _all_ operators that are part of
>> >     the task.
>> >
>> >
>> > The registered metrics of OperatorCoordinator always are those which
>> > can not be aggregated from the tasks, like the number of unassigned
>> > splits in the SourceCoordinator. Actually we have not encountered the
>> > scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> > should only contain the metrics for itself instead of its subtasks.
>> >
>> > Using metrics.scope.jm.job is not enough to distinguish the different
>> > coordinators. The operator id/name is necessary. Then
>> > the implementation should be like this. But users can not change the
>> > scope in this way. This is also acceptable.
>> > @Internal public class InternalOperatorCoordinatorMetricGroup
>> >          extends ProxyMetricGroup<MetricGroup>
>> >          implements OperatorCoordinatorMetricGroup {
>> >      public InternalOperatorCoordinatorMetricGroup(
>> >              JobManagerJobMetricGroup parent, OperatorID operatorID,
>> String operatorName) {
>> >          super(parent.addGroup(operatorID +
>> operatorName).addGroup("coordinator")); }
>> > }
>> > So there are two choice for the
>> > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>> > Which one is better?
>> >
>> > Best,
>> > Hang
>> >
>> >
>> > Chesnay Schepler <ches...@apache.org> 于2023年1月18日周三 17:03写道：
>> >
>> >     You're misunderstanding the problem.
>> >
>> >     Metric groups form a tree with each group providing certain
>> metadata.
>> >
>> >     E.g., on the taskmanager we have a TM metric group that provides
>> info
>> >     about the TM, that has child task metric groups, that have child
>> >     operator metric groups etc. The operator metric group again has
>> >     children, where we are now mostly in the user/connector land, like a
>> >     kafka metric group providing partition info or something.
>> >
>> >     What is being proposed is to create a coordinator metric group on
>> >     the JM
>> >     side that provides operator metadata.
>> >
>> >     A more appropriate structure is to create an operator group on the
>> JM
>> >     side that provides this info, with the coordinator metric group
>> >     being a
>> >     child.
>> >
>> >     On 17/01/2023 19:56, Steven Wu wrote:
>> >     >> Additionally, the configurable variables (operator name/id) are
>> >     > logically not attached to the coordinator, but operators, so to
>> >     me it
>> >     > just doesn't make sense to structure it like this.
>> >     >
>> >     > Chesnay, maybe we should clarify the terminology. To me,
>> >     pperators (like
>> >     > FLIP-27 source) can have two parts (coordinator and
>> >     reader/subtask). I
>> >     > think it is fine to include operator name/id for coordinator
>> >     metrics.
>> >     >
>> >     > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>> >     <ches...@apache.org> wrote:
>> >     >
>> >     >> Slight correction: Using metrics.scope.jm.job as the default
>> >     should be
>> >     >> safe.
>> >     >>
>> >     >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>> >     >>> The proposed ScopeFormat is still problematic for a few reasons.
>> >     >>>
>> >     >>> Extending the set of ScopeFormats is problematic because it in
>> >     >>> practice it breaks the config if users actively rely on it,
>> since
>> >     >>> there's now another key that they _must_ set for it to be
>> >     >>> consistent/compatible with their existing setup.
>> >     >>> Unfortunately due to how powerful scope formats are we can't
>> >     derive a
>> >     >>> default value that matches their existing setup.
>> >     >>> Hence we should try to do this as rarely as possible.
>> >     >>>
>> >     >>> This FLIP does not adhere to that since it proposes a
>> >     dedicated format
>> >     >>> for coordinators; next time we want to expose operator-specific
>> >     >>> metrics (e.g., in the scheduler) we'd have to add another one to
>> >     >>> support it.
>> >     >>>
>> >     >>> Additionally, the configurable variables (operator name/id) are
>> >     >>> logically not attached to the coordinator, but operators, so
>> >     to me it
>> >     >>> just doesn't make sense to structure it like this.
>> >     >>>
>> >     >>> Another thing I'm concerned about is that, because we don't
>> >     include
>> >     >>> tasks in the hierarchy, users wishing to collect all metrics
>> for a
>> >     >>> particular task (in this case ==vertex) now have to go
>> >     significantly
>> >     >>> out of their way to get them, since they can no longer just
>> >     filter by
>> >     >>> the task ID but have to be filter for _all_ operators that are
>> >     part of
>> >     >>> the task.
>> >     >>>
>> >     >>> On 16/01/2023 03:09, Hang Ruan wrote:
>> >     >>>> Hi, @ches...@apache.org <ches...@apache.org> ,
>> >     >>>>
>> >     >>>> Do you have time to help to review this FLIP again after the
>> >     >>>> modification?
>> >     >>>> Looking forward to your reply.
>> >     >>>> This FLIP will add a new configuration for the
>> >     >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>> >     internal
>> >     >>>> implementation and is added as a component to the
>> >     >>>> JobManagerJobMetricGroup.
>> >     >>>> If something doesn't make sense, could you provide some
>> >     advice? It
>> >     >>>> will be
>> >     >>>> very helpful. Thanks a lot for your help.
>> >     >>>>
>> >     >>>> Best,
>> >     >>>> Hang
>> >     >>>>
>> >     >>>> Martijn Visser <martijnvis...@apache.org> 于2023年1月11日周三
>> >     16:34写道：
>> >     >>>>
>> >     >>>>
>> >     >>>>> Hi Hang,
>> >     >>>>>
>> >     >>>>> I'm a bit surprised that this has gone to a vote, given that
>> >     Chesnay
>> >     >>>>> deliberately mentioned that he would vote against it as-is.
>> >     I would
>> >     >>>>> expect
>> >     >>>>> that before going to a vote, he has had the opportunity to
>> >     >>>>> participate in
>> >     >>>>> this discussion.
>> >     >>>>>
>> >     >>>>> Best regards,
>> >     >>>>>
>> >     >>>>> Martijn
>> >     >>>>>
>> >     >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imj...@gmail.com>
>> >     wrote:
>> >     >>>>>
>> >     >>>>>> Hi Dong,
>> >     >>>>>>
>> >     >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>> >     concern is
>> >     >>>>>> that
>> >     >>>>>> this is a core interface for the FLIP. It’s hard to tell how
>> >     >>>>>> sources use
>> >     >>>>>> metric group without mentioning this interface. Even if
>> >     this is an
>> >     >>>>> existing
>> >     >>>>>> API, I think it’s worth introducing the interface again and
>> >     declaring
>> >     >>>>> that
>> >     >>>>>> we will implement the interface instead of a no-op method
>> >     in this
>> >     >>>>>> FLIP.
>> >     >>>>>>
>> >     >>>>>> Anyway, this is a minor problem and shouldn’t block this
>> >     FLIP. I’m
>> >     >>>>>> +1 to
>> >     >>>>>> start a vote.
>> >     >>>>>>
>> >     >>>>>> Best,
>> >     >>>>>> Jark
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>>> 2023年1月3日 10:03，Hang Ruan <ruanhang1...@gmail.com> 写道：
>> >     >>>>>>>
>> >     >>>>>>> Hi, Jark and Dong,
>> >     >>>>>>>
>> >     >>>>>>> Thanks for your comments. Sorry for my late reply.
>> >     >>>>>>>
>> >     >>>>>>> For suggestion 1, I plan to implement the
>> >     >>>>>>> SplitEnumeratorMetricGroup in
>> >     >>>>>>> another issue, and it is not contained in this FLIP. I
>> >     will add some
>> >     >>>>>>> description about this part.
>> >     >>>>>>> For suggestion 2, changes about
>> >     OperatorCoordinator#metricGroup has
>> >     >>>>>> already
>> >     >>>>>>> been documented in the proposed change section.
>> >     >>>>>>>
>> >     >>>>>>> Best,
>> >     >>>>>>> Hang
>> >     >>>>>>>
>> >     >>>>>>> Dong Lin <lindon...@gmail.com> 于2023年1月1日周日 09:45写道：
>> >     >>>>>>>
>> >     >>>>>>>> Let me chime-in and add comments regarding the public
>> >     interface
>> >     >>>>> section.
>> >     >>>>>>>> Please see my comments inline.
>> >     >>>>>>>>
>> >     >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>> >     <imj...@gmail.com> wrote:
>> >     >>>>>>>>
>> >     >>>>>>>>> Hi Hang,
>> >     >>>>>>>>>
>> >     >>>>>>>>> Thanks for driving this discussion. I think this is a
>> >     very useful
>> >     >>>>>> feature
>> >     >>>>>>>>> for connectors.
>> >     >>>>>>>>>
>> >     >>>>>>>>> The FLIP looks quite good to me, and I just have two
>> >     suggestions.
>> >     >>>>>>>>>
>> >     >>>>>>>>> 1. In the "Public Interface" section, mention that the
>> >     >>>>>>>>> implementation
>> >     >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>> >     changed from
>> >     >>>>>>>> returning
>> >     >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>> >     instance.
>> >     >>>>> Even
>> >     >>>>>>>>> though the API is already there, the behavior change can
>> >     also be
>> >     >>>>>>>> considered
>> >     >>>>>>>>> a public change.
>> >     >>>>>>>>>
>> >     >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>> >     this FLIP
>> >     >>>>>>>> does
>> >     >>>>>> not
>> >     >>>>>>>> seem to change its semantics/behavior. The FLIP does
>> >     change the
>> >     >>>>>>>> implementation/behavior of
>> >     SourceCoordinatorContext#metricGroup,
>> >     >>>>>>>> which
>> >     >>>>>> is
>> >     >>>>>>>> marked @Internal.
>> >     >>>>>>>>
>> >     >>>>>>>> Thus it might seem a bit weird to add in the public
>> interface
>> >     >>>>>>>> section
>> >     >>>>>>>> saying that we change the interface
>> >     >>>>>>>> SplitEnumeratorContext#metricGroup
>> >     >>>>>> from
>> >     >>>>>>>> returning null to non-null object.
>> >     >>>>>>>>
>> >     >>>>>>>> 2. Mention the newly added interface of
>> >     >>>>>> "OperatorCoordinator#metricGroup"
>> >     >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>> >     section. As
>> >     >>>>> the
>> >     >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>> >     connectors.
>> >     >>>>>> Though
>> >     >>>>>>>> it
>> >     >>>>>>>>> is still an @Internal API, I think it is worth
>> >     mentioning the
>> >     >>>>>>>>> change
>> >     >>>>> in
>> >     >>>>>>>> the
>> >     >>>>>>>>> FLIP.
>> >     >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>> >     reasonable to
>> >     >>>>>>>> explain it in the proposed change section. The FLIP seems
>> >     to have
>> >     >>>>>>>> documented this in point 5 of the proposed change section.
>> >     >>>>>>>>
>> >     >>>>>>>> BTW, if we think there are @internal classes that are
>> >     important
>> >     >>>>>>>> enough
>> >     >>>>>> to
>> >     >>>>>>>> be added in the public interface section, it might be
>> >     useful to
>> >     >>>>>> explicitly
>> >     >>>>>>>> discuss this topic and document it in the "*What are the
>> >     "public
>> >     >>>>>>>> interfaces" of the project*" in this
>> >     >>>>>>>> <
>> >     >>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> >     >>>>>>>> wiki.
>> >     >>>>>>>>
>> >     >>>>>>>>
>> >     >>>>>>>>> Best,
>> >     >>>>>>>>> Jark
>> >     >>>>>>>>>
>> >     >>>>>>>>>
>> >     >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>> >     <ruanhang1...@gmail.com>
>> >     >>>>>> wrote:
>> >     >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>> >     >>>>>>>>>> After combining everyone's comments, the main concerns
>> and
>> >     >>>>>>>> corresponding
>> >     >>>>>>>>>> adjustments are as follows.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Q1: Common metrics are not quite useful.
>> >     >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>> >     from the
>> >     >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
>> not
>> >     >>>>>>>>>> provide
>> >     >>>>>>>>> enough
>> >     >>>>>>>>>> information for users. The users are more willing to
>> >     get the
>> >     >>>>>>>>>> number
>> >     >>>>> of
>> >     >>>>>>>>>> events of the specified type instead of the total
>> >     number. And this
>> >     >>>>>>>> metric
>> >     >>>>>>>>>> is calculated differently. The implementation could
>> >     register the
>> >     >>>>>> metric
>> >     >>>>>>>>> by
>> >     >>>>>>>>>> themselves.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Q2: This FLIP is overly complicated.
>> >     >>>>>>>>>> This FLIP will become concise after these modifications.
>> >     >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>> >     introduced into
>> >     >>>>> Flink
>> >     >>>>>>>> by
>> >     >>>>>>>>>> FLIP-179<
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >     >>>>>>>>>>> .
>> >     >>>>>>>>>> And
>> >     >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
>> >     new metric
>> >     >>>>>>>> option
>> >     >>>>>>>>>> and a new metric group scope. The changes in proposed
>> >     changes
>> >     >>>>> provide
>> >     >>>>>>>> the
>> >     >>>>>>>>>> details about the modifications for the internal
>> >     classes, which
>> >     >>>>> might
>> >     >>>>>>>>> make
>> >     >>>>>>>>>> it look complicated.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Thanks for all the comments again. If there are no
>> further
>> >     >>>>>>>>>> comments,
>> >     >>>>>> we
>> >     >>>>>>>>>> plan to start the voting thread this week.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Best,
>> >     >>>>>>>>>> Hang
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Qingsheng Ren <renqs...@gmail.com> 于2022年12月26日周一
>> >     16:48写道：
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>> Thanks for the FLIP, Hang!
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>> >     the same
>> >     >>>>> concern
>> >     >>>>>>>>>> with
>> >     >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
>> >     quite
>> >     >>>>>>>>>>> useful
>> >     >>>>>>>> to
>> >     >>>>>>>>>> end
>> >     >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
>> >     which
>> >     >>>>> requires
>> >     >>>>>>>>>>> instantialization in order to be practical to users and
>> >     >>>>>>>>>>> developers,
>> >     >>>>>>>> so
>> >     >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Best,
>> >     >>>>>>>>>>> Qingsheng
>> >     >>>>>>>>>>> Ververica (Alibaba)
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>> >     <reed...@gmail.com>
>> >     >>>>> wrote:
>> >     >>>>>>>>>>>> Hi Hang,
>> >     >>>>>>>>>>>> I still see no strong reason why we need
>> >     >>>>>>>>>>>> numEventsIn/numEventsOut
>> >     >>>>>>>>>>> metrics.
>> >     >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>> >     concern
>> >     >>>>>>>>>>>> from
>> >     >>>>>>>>>> others.
>> >     >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>> >     >>>>>>>>>>>> over-extending
>> >     >>>>>>>>> the
>> >     >>>>>>>>>>>> scope.
>> >     >>>>>>>>>>>>
>> >     >>>>>>>>>>>> Thanks,
>> >     >>>>>>>>>>>> Zhu
>> >     >>>>>>>>>>>>
>> >     >>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月23日周五
>> >     >>>>>>>>>>>> 15:21写道：
>> >     >>>>>>>>>>>>> Hi everyone,
>> >     >>>>>>>>>>>>>
>> >     >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
>> >     the common
>> >     >>>>>>>>>>> metrics,
>> >     >>>>>>>>>>>> which will help to observe incoming and outgoing
>> >     events. What do
>> >     >>>>>>>> you
>> >     >>>>>>>>>>> think,
>> >     >>>>>>>>>>>> @Zhu Zhu ?
>> >     >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>> >     more concerned
>> >     >>>>>>>>>> about?
>> >     >>>>>>>>>>>> Looking forward to your reply.
>> >     >>>>>>>>>>>>> Thanks for all the comments. If there are no further
>> >     >>>>>>>>>>>>> comments, we
>> >     >>>>>>>>>> plan
>> >     >>>>>>>>>>>> to start the voting thread next week.
>> >     >>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>
>> >     >>>>>>>>>>>>> Hang Ruan <ruanhang1...@gmail.com> 于2022年12月15日周四
>> >     >>>>>>>>>>>>> 16:49写道：
>> >     >>>>>>>>>>>>>> Hi, Zhu Zhu,
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> Thanks for your feedback!
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>> >     different. And
>> >     >>>>>>>>>>>>>> their
>> >     >>>>>>>>>>>> metrics are much different too. We try to find the
>> common
>> >     >>>>>>>>>>>> metrics
>> >     >>>>>>>> and
>> >     >>>>>>>>>> put
>> >     >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>> >     developers
>> >     >>>>>>>> think
>> >     >>>>>>>>> we
>> >     >>>>>>>>>>> do
>> >     >>>>>>>>>>>> not need these common metrics, removing the common
>> >     metrics is
>> >     >>>>>>>>>>>> also
>> >     >>>>>>>>>>>> acceptable.
>> >     >>>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2022年12月14日周三
>> >     22:09写道：
>> >     >>>>>>>>>>>>>>> Hi Hang & MengYue,
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>> >     >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
>> >     >>>>>>>>>>>>>>> Currently
>> >     >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>> >     >>>>>>>> coordinator
>> >     >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>> >     to have
>> >     >>>>>>>>>>>>>>> access
>> >     >>>>>>>>>>>>>>> to a registered metric group (via the existing
>> public
>> >     >>>>>>>>>>>>>>> interface
>> >     >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>> >     null at the
>> >     >>>>>>>>> moment.
>> >     >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>> >     metrics. A clear
>> >     >>>>>>>>>>>> definition
>> >     >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>> >     metrics. The
>> >     >>>>>>>>>> definition
>> >     >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>> multiple
>> >     >>>>>>>>>>>> OperatorCoordinators
>> >     >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>> >     have its
>> >     >>>>>>>>>>>>>>> own
>> >     >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>> >     be directly
>> >     >>>>>>>>>>> registered
>> >     >>>>>>>>>>>>>>> to the job metric group.
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>> >     introduced
>> >     >>>>>>>>> common
>> >     >>>>>>>>>>>> metrics
>> >     >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
>> >     any case
>> >     >>>>>>>> which
>> >     >>>>>>>>>>>> strongly
>> >     >>>>>>>>>>>>>>> requires them?
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>> >     >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>> overkill
>> >     >>>>>>>>>>>>>>> Directly using the job metric group can result in
>> >     metric
>> >     >>>>>>>>> conflicts,
>> >     >>>>>>>>>>> as
>> >     >>>>>>>>>>>> mentioned
>> >     >>>>>>>>>>>>>>> in above #2.
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Thanks,
>> >     >>>>>>>>>>>>>>> Zhu
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Dong Lin <lindon...@gmail.com> 于2022年12月10日周六
>> >     >>>>>>>>>>>>>>> 14:16写道：
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Hi Chesney,
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Just to double check with you,
>> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>> >     >>>>>>>>>>>> (annotated as
>> >     >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
>> >     Flink by
>> >     >>>>>>>>>> FLIP-179
>> >     >>>>>>>>>>>>>>>> <
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >     >>>>>>>>>>>>> .
>> >     >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>> >     should remove
>> >     >>>>>>>>> this
>> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Regards,
>> >     >>>>>>>>>>>>>>>> Dong
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>> >     >>>>>>>>>>> ches...@apache.org>
>> >     >>>>>>>>>>>> wrote:
>> >     >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>> >     complicated. A
>> >     >>>>>>>>>>> dedicated
>> >     >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>> >     could just
>> >     >>>>>>>>> re-use
>> >     >>>>>>>>>>> the
>> >     >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>> >     structure we
>> >     >>>>>>>>> have
>> >     >>>>>>>>>> on
>> >     >>>>>>>>>>>> TMs,
>> >     >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
>> >     anyway,
>> >     >>>>>>>>>> because
>> >     >>>>>>>>>>>> all the
>> >     >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
>> >     the TM
>> >     >>>>>>>> side
>> >     >>>>>>>>> +
>> >     >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
>> >     to pull a
>> >     >>>>>>>>> fast
>> >     >>>>>>>>>>>> one on
>> >     >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>> >     against it.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>> >     >>>>>>>>>>>>>>>>>> Hi Hang,
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
>> >     is pretty
>> >     >>>>>>>>>>>> informative.
>> >     >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>> >     >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>> >     >>>>>>>> consistency
>> >     >>>>>>>>>>> with
>> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>> >     >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>> >     >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>> >     >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>> >     >>>>>>>>>>>> for
>> >     >>>>>>>>>>>>>>>>>> simplicity and consistency with
>> >     SCOPE_NAMING_OPERATOR
>> >     >>>>>>>>> (which
>> >     >>>>>>>>>> is
>> >     >>>>>>>>>>>> not named
>> >     >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> Cheers,
>> >     >>>>>>>>>>>>>>>>>> Dong
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>> >     >>>>>>>>>>> ruanhang1...@gmail.com>
>> >     >>>>>>>>>>>> wrote:
>> >     >>>>>>>>>>>>>>>>>>> Hi all,
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>> >     metric group
>> >     >>>>>>>>> for
>> >     >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>> >     >>>>>>>>> coordinator
>> >     >>>>>>>>>>> for
>> >     >>>>>>>>>>>> runtime
>> >     >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>> >     coordination
>> >     >>>>>>>>>>>> mechanism is
>> >     >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>> >     and its all
>> >     >>>>>>>>>>>> operators, the
>> >     >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink,
>> for
>> >     >>>>>>>> example
>> >     >>>>>>>>>> many
>> >     >>>>>>>>>>>> Sources
>> >     >>>>>>>>>>>>>>>>> and
>> >     >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
>> and
>> >     >>>>>>>>>> coordinate
>> >     >>>>>>>>>>>> commits to
>> >     >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>> >     widely
>> >     >>>>>>>> using
>> >     >>>>>>>>> in
>> >     >>>>>>>>>>>> flink kafka
>> >     >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>> >     connector,
>> >     >>>>>>>>>> flink
>> >     >>>>>>>>>>>> hudi
>> >     >>>>>>>>>>>>>>>>>>> connector and so on.
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
>> >     for the
>> >     >>>>>>>>>>>>>>>>> OperatorCoordinator
>> >     >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>> >     >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>> >     >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>> >     how many
>> >     >>>>>>>>>>>> splits/partitions
>> >     >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
>> >     files
>> >     >>>>>>>> have
>> >     >>>>>>>>>> been
>> >     >>>>>>>>>>>> written
>> >     >>>>>>>>>>>>>>>>> out
>> >     >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>> >     users to
>> >     >>>>>>>> know
>> >     >>>>>>>>>> the
>> >     >>>>>>>>>>>> job
>> >     >>>>>>>>>>>>>>>>> progress
>> >     >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus
>> we
>> >     >>>>>>>> propose
>> >     >>>>>>>>>> the
>> >     >>>>>>>>>>>> FLIP-274
>> >     >>>>>>>>>>>>>>>>> to
>> >     >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>> >     >>>>>>>> OperatorCoordinator
>> >     >>>>>>>>>> and
>> >     >>>>>>>>>>>> provide
>> >     >>>>>>>>>>>>>>>>> an
>> >     >>>>>>>>>>>>>>>>>>> internal implementation for
>> >     >>>>>>>>> OperatorCoordinatorMetricGroup.
>> >     >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>> >     time? Any
>> >     >>>>>>>>>>> feedback
>> >     >>>>>>>>>>>> is
>> >     >>>>>>>>>>>>>>>>>>> appreciated!
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> [1]
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>> >     >>
>> >
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Reply via email to