Hi Xintong,

In terms of code, I think it's not complicated. It's all about we need a
public discussion for the new metric name.
And we don't want to block the release for the rarely used metric.

Best,
Jark

On Fri, 14 Oct 2022 at 10:07, Xintong Song <tonysong...@gmail.com> wrote:

> @Qingsheng,
>
> I'm overall +1 to your proposal, with only one question: How complicated
> is it to come up with a metric for the internal traffic?
>
> I'm asking because, as the new feature is already out for 1.15 & 1.16, it
> would be nice if the corresponding new metrics can also be available in
> these versions.
>
> I'm not saying this should block the 1.16 release. Just trying to
> understand the efforts needed. If adding such metrics is not something that
> can be done shortly, I'd also be fine with releasing 1.16 without them.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi devs and users,
>>
>> It looks like we are getting an initial consensus in the discussion so I
>> started a voting thread [1] just now. Looking forward to your feedback!
>>
>> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6
>>
>> Best,
>> Qingsheng
>>
>>
>> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <j...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Thanks for the clarification. +1, I like the idea. Pointing both
>>> numXXXOut and numXXXSend to the same external data transfer metric does not
>>> really break the new SinkV2 design, since there was no requirement to
>>> monitor the internal traffic. So, I think both developer and user can live
>>> with it. It might not be the perfect solution but is indeed the currently
>>> best trade-off solution after considering the backward compatibility.  I
>>> would suggest firing a follow-up ticket after the PR to take care of the
>>> new metric for the internal traffic in the future.
>>>
>>> Best regards,
>>> Jing
>>>
>>>
>>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> Hi Jing,
>>>>
>>>> Thanks for the reply!
>>>>
>>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>>>> SinkWriterOperator to reflect the traffic to the external system for
>>>> compatibility with old versions before 1.15, and make numXXXSend have the
>>>> same value as numXXXOut for compatibility within 1.15. That means both
>>>> numXXXOut and numXXXSend are used for external data transfers, which end
>>>> users care more about. As for the internal traffic within the sink, we
>>>> could name a new metric for it because this is a _new_ feature in the _new_
>>>> sink, and end users usually don’t pay attention to internal implementation.
>>>> The name of the new metric could be discussed later after 1.16 release.
>>>>
>>>> > but it might end up with monitoring unexpected metrics, which is
>>>> even worse for users, i.e. I didn't change anything, but something has been
>>>> broken since the last update.
>>>>
>>>> Yeah this is exactly what we are trying to fix with this proposal. I
>>>> believe users are more concerned with the output to the external system
>>>> than the internal data delivery in the sink, so I think we’ll have more
>>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>>>> the output of the job, but after upgrading to 1.15 this value is extremely
>>>> low and I didn’t change anything” if we stick to the current situation. I
>>>> think only a few end users care about the number of committables sending to
>>>> downstream as most of them don’t care how the sink works.
>>>>
>>>> We do need a re-design to fully distinguish the internal and external
>>>> traffic on metrics, not only in sink but in all operators as it’s quite
>>>> common for operators to make IO. This needs time to design, discuss, adjust
>>>> and vote, but considering this is blocking 1.16, maybe it’s better to
>>>> rescue the compatibility for now, and leave the huge reconstruction to
>>>> future versions (maybe 2.0).
>>>>
>>>> Best,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <j...@ververica.com> wrote:
>>>>
>>>>> Hi Qingsheng,
>>>>>
>>>>> Just want to make sure we are on the same page. Are you suggesting
>>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>>>
>>>>> For the naming switch, please pay attention that the behaviour has
>>>>> been changed since we introduced SinkV2[1]. So, please be aware of
>>>>> different numbers(behaviour change) even with the same metrics name.
>>>>> Sticking with the old name with the new behaviour (very bad idea, IMHO)
>>>>> might seem like saving the effort in the first place, but it might end up
>>>>> with monitoring unexpected metrics, which is even worse for users, i.e. I
>>>>> didn't change anything, but something has been broken since
>>>>> the last update.
>>>>>
>>>>> For reverting, I am not sure how to fix the issue mentioned in
>>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>>>> out, with SinkV2 we have two different output lines - one with the 
>>>>> external
>>>>> system and the other with the downstream operator. In this case,
>>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>>>> monitor.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Jing
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>>>
>>>>>
>>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> As a supplement, considering it could be a big reconstruction
>>>>>> redefining internal and external traffic and touching metric names in
>>>>>> almost all operators, this requires a lot of discussions and we might
>>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>>>> in front of us, as the output of sink is a metric that users care a
>>>>>> lot about.
>>>>>>
>>>>>> Thanks,
>>>>>> Qingsheng
>>>>>>
>>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>>>> > metric definition distinguishing internal and external data
>>>>>> transfers.
>>>>>> > As you described, having IO in operators is quite common such as
>>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP
>>>>>> and
>>>>>> > an overall design.
>>>>>> >
>>>>>> > However I think it's necessary to change the metric back to
>>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>>>> > 1.16. The most important argument is for compatibility as I
>>>>>> mentioned
>>>>>> > in my previous email, otherwise all users have to modify their
>>>>>> configs
>>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>>>> > connectors have to change their implementations to migrate to the
>>>>>> new
>>>>>> > metric name. I believe other ones participating and approving this
>>>>>> > proposal share the same concern about compatibility too. Also
>>>>>> > considering this issue is blocking the release of 1.16, maybe we
>>>>>> could
>>>>>> > fix this asap, and as for defining a new metric for internal data
>>>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>>>> >
>>>>>> > Best,
>>>>>> > Qingsheng
>>>>>> >
>>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <
>>>>>> ches...@apache.org> wrote:
>>>>>> > >
>>>>>> > > Currently I think that would be a mistake.
>>>>>> > >
>>>>>> > > Ultimately what we have here is the culmination of us never
>>>>>> really considering how the numRecordsOut metric should behave for 
>>>>>> operators
>>>>>> that emit data to other operators _and_ external systems. This goes 
>>>>>> beyond
>>>>>> sinks.
>>>>>> > > This even applies to numRecordsIn, for cases where functions
>>>>>> query/write data from/to the outside, (e.g., Async IO).
>>>>>> > >
>>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal
>>>>>> data transfers, and 1 exclusively for external data transfers, is the 
>>>>>> only
>>>>>> way to get a consistent metric definition in the long-run.
>>>>>> > > We can jump back-and-forth now or just commit to it.
>>>>>> > >
>>>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>>>> operator in a chain anymore.
>>>>>> > >
>>>>>> > > What I would suggest is to stick with what we got (although I
>>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>>>> all non-TwoPhaseCommittingSink.
>>>>>> > >
>>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks for the details Chesnay!
>>>>>> > >
>>>>>> > > By “alias” I mean to respect the original definition made in
>>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>>>> external system, and keep numRecordsSend as the same value as 
>>>>>> numRecordsOut
>>>>>> for compatibility.
>>>>>> > >
>>>>>> > > I think keeping numRecordsOut for the output to the external
>>>>>> system is more intuitive to end users because in most cases the metric of
>>>>>> data flow output is more essential. I agree with you that a new metric is
>>>>>> required, but considering compatibility and users’ intuition I prefer to
>>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>>>> metric for sink writer’s output to downstream operators. This might be
>>>>>> against consistency with metrics in other operators in Flink but maybe 
>>>>>> it’s
>>>>>> acceptable to have the sink as a special case.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <
>>>>>> ches...@apache.org>, wrote:
>>>>>> > >
>>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>>> numXXXOut
>>>>>> > >
>>>>>> > > But that's not possible. If it were that simple there would have
>>>>>> never been a need to introduce another metric in the first place.
>>>>>> > >
>>>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>>>> that they emit data to the external system (usually considered as
>>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>>>> both (which is completely wrong).
>>>>>> > >
>>>>>> > > A new metric was always required; otherwise you inevitably end up
>>>>>> breaking some semantic.
>>>>>> > > Adding a new metric for what the sink writes to the external
>>>>>> system is, for better or worse, more consistent with how these metrics
>>>>>> usually work in Flink.
>>>>>> > >
>>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks everyone for joining the discussion!
>>>>>> > >
>>>>>> > > > Do you have any idea what has happened in the process here?
>>>>>> > >
>>>>>> > > The discussion in this PR [1] shows some details and could be
>>>>>> helpful to understand the original motivation of the renaming. We do 
>>>>>> have a
>>>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>>>> so the defense was broken.
>>>>>> > >
>>>>>> > > I think the reason why both the developer and the reviewer forgot
>>>>>> to trigger an discussion and gave a green pass on the change is that
>>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>>>> should be treated carefully while contributing and reviewing.
>>>>>> > >
>>>>>> > > IMHO three actions could be made to prevent this kind of changes
>>>>>> in the future:
>>>>>> > >
>>>>>> > > a. Add test case for metrics (which we already have in
>>>>>> SinkMetricsITCase)
>>>>>> > > b. We emphasize that any public-interface breaking changes should
>>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed 
>>>>>> in
>>>>>> the release note.
>>>>>> > > c. We remind contributors and reviewers about what should be
>>>>>> considered as public API, and include metric names in it.
>>>>>> > >
>>>>>> > > For b and c these two pages [2][3] might be proper places.
>>>>>> > >
>>>>>> > > About the patch to revert this, it looks like we have a consensus
>>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>>>> didn’t see complaints about this for now so it should be OK to save the
>>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias 
>>>>>> of
>>>>>> numXXXOut considering there could possibly some users have already 
>>>>>> adapted
>>>>>> their system to the new naming, and have another internal metric for
>>>>>> reflecting number of outgoing committable batches (actually the
>>>>>> numRecordsIn of sink committer operator should be carrying this info
>>>>>> already).
>>>>>> > >
>>>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <tonysong...@gmail.com>,
>>>>>> wrote:
>>>>>> > >
>>>>>> > > +1 for reverting these changes in Flink 1.16.
>>>>>> > >
>>>>>> > > For 1.15.3, can we make these metrics available via both names
>>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>>>> already migrated to 1.15 and numXXXSend. That means we still need to 
>>>>>> change
>>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>>>> internal to Flink sink.
>>>>>> > >
>>>>>> > > I'm overall +1 to change numXXXOut back to its original
>>>>>> semantics. AFAIK (from meetup / flink-forward questionaires), most users 
>>>>>> do
>>>>>> not migrate to a new Flink release immediately, until the next 1-2 major
>>>>>> releases are out.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > >
>>>>>> > > Xintong
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>>>> martijnvis...@apache.org> wrote:
>>>>>> > >>
>>>>>> > >> Hi Qingsheng,
>>>>>> > >>
>>>>>> > >> Do you have any idea what has happened in the process here? Do
>>>>>> we know why
>>>>>> > >> they were changed? I was under the impression that these metric
>>>>>> names were
>>>>>> > >> newly introduced due to the new interfaces and because it still
>>>>>> depends on
>>>>>> > >> each connector implementing these.
>>>>>> > >>
>>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as
>>>>>> a public
>>>>>> > >> API. Might make sense to have a separate follow-up to add that
>>>>>> to the list
>>>>>> > >> (I do think we should list them there).
>>>>>> > >>
>>>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>>>> > >>
>>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change:
>>>>>> I think the
>>>>>> > >> impact is too big for a patch version, especially given how long
>>>>>> Flink 1.15
>>>>>> > >> is already out there.
>>>>>> > >>
>>>>>> > >> Best regards,
>>>>>> > >>
>>>>>> > >> Martijn
>>>>>> > >>
>>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xbjt...@gmail.com>
>>>>>> wrote:
>>>>>> > >>
>>>>>> > >> > Thanks Qingsheng for starting this thread.
>>>>>> > >> >
>>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix
>>>>>> this
>>>>>> > >> > inconsistent behavior.
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Leonard
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <imj...@gmail.com> 写道:
>>>>>> > >> >
>>>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>>>> > >> >
>>>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>>>> > >> >
>>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the
>>>>>> new and old
>>>>>> > >> > sink is inconsistent.
>>>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>>>> V2 is an
>>>>>> > >> > evolving API which is just introduced in 1.15.
>>>>>> > >> > I think it makes sense to break the unstable API instead of
>>>>>> the stable API
>>>>>> > >> > which many connectors and users depend on.
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Jark
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>>>> jingsongl...@gmail.com> wrote:
>>>>>> > >> >
>>>>>> > >> >> Thanks for driving, Qingsheng.
>>>>>> > >> >>
>>>>>> > >> >> +1 for reverting sink metric name.
>>>>>> > >> >>
>>>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>>>> > >> >>
>>>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>>>> > >> >>
>>>>>> > >> >> Best,
>>>>>> > >> >> Jingsong
>>>>>> > >> >>
>>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>>>> becket....@gmail.com> wrote:
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>>>> > >> >> >
>>>>>> > >> >> > +1 on reverting the breaking changes.
>>>>>> > >> >> >
>>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>>>> and update
>>>>>> > >> >> the previous release docs with this known issue, so that
>>>>>> users can upgrade
>>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>>>> backwards
>>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>>>> changes like
>>>>>> > >> >> this in the future.
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks,
>>>>>> > >> >> >
>>>>>> > >> >> > Jiangjie (Becket) Qin
>>>>>> > >> >> >
>>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>>>> re...@apache.org> wrote:
>>>>>> > >> >> >>
>>>>>> > >> >> >> Hi devs and users,
>>>>>> > >> >> >>
>>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>>>> change about
>>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>>> [2].
>>>>>> > >> >> >>
>>>>>> > >> >> >> TL;DR
>>>>>> > >> >> >>
>>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>>> are replace
>>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>>> metric names
>>>>>> > >> >> are public APIs, this is a breaking change to end users and
>>>>>> not backward
>>>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>>>> discussed in
>>>>>> > >> >> the mailing list before.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Background
>>>>>> > >> >> >>
>>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>>> changed so
>>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>>>> is used for
>>>>>> > >> >> reporting the total number of output records since the sink
>>>>>> started (number
>>>>>> > >> >> of records written to the external system), and similarly for
>>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>>> “numBytesOutPerSecond” and
>>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming
>>>>>> and definition.
>>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>>>> “numXXXOut”
>>>>>> > >> >> could be used by the output of SinkWriterOperator for
>>>>>> reporting number of
>>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>>>> resolve the
>>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>>> metrics with
>>>>>> > >> >> “numXXXSend”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Necessity of reverting this change
>>>>>> > >> >> >>
>>>>>> > >> >> >> - Metric names are actually public API, as end users need
>>>>>> to configure
>>>>>> > >> >> metric collecting and alerting system with metric names.
>>>>>> Users have to
>>>>>> > >> >> reset all configurations related to affected metrics.
>>>>>> > >> >> >> - This could also affect custom and external sinks not
>>>>>> maintained by
>>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>>>> > >> >> >> - The number of records sent to external system is way
>>>>>> more important
>>>>>> > >> >> than the number of Committables sent to
>>>>>> SinkCommitterOperator, as the
>>>>>> > >> >> latter one is just an internal implementation of sink. We
>>>>>> could have a new
>>>>>> > >> >> metric name for the latter one instead.
>>>>>> > >> >> >> - We could avoid splitting the project by version (like
>>>>>> “plz use
>>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>>>> it ASAP,
>>>>>> > >> >> cosidering 1.16 is still not released for now.
>>>>>> > >> >> >>
>>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users
>>>>>> about your
>>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Looking forward to your reply!
>>>>>> > >> >> >>
>>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>>> > >> >> >> [1] FLIP-33, version 18:
>>>>>> > >> >>
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>>> > >> >> >>
>>>>>> > >> >> >> Best,
>>>>>> > >> >> >> Qingsheng
>>>>>> > >> >>
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>>
>>>>>

Reply via email to