Hi Xintong, In terms of code, I think it's not complicated. It's all about we need a public discussion for the new metric name. And we don't want to block the release for the rarely used metric.
Best, Jark On Fri, 14 Oct 2022 at 10:07, Xintong Song <tonysong...@gmail.com> wrote: > @Qingsheng, > > I'm overall +1 to your proposal, with only one question: How complicated > is it to come up with a metric for the internal traffic? > > I'm asking because, as the new feature is already out for 1.15 & 1.16, it > would be nice if the corresponding new metrics can also be available in > these versions. > > I'm not saying this should block the 1.16 release. Just trying to > understand the efforts needed. If adding such metrics is not something that > can be done shortly, I'd also be fine with releasing 1.16 without them. > > Best, > > Xintong > > > > On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote: > >> Hi devs and users, >> >> It looks like we are getting an initial consensus in the discussion so I >> started a voting thread [1] just now. Looking forward to your feedback! >> >> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6 >> >> Best, >> Qingsheng >> >> >> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <j...@ververica.com> wrote: >> >>> Hi Qingsheng, >>> >>> Thanks for the clarification. +1, I like the idea. Pointing both >>> numXXXOut and numXXXSend to the same external data transfer metric does not >>> really break the new SinkV2 design, since there was no requirement to >>> monitor the internal traffic. So, I think both developer and user can live >>> with it. It might not be the perfect solution but is indeed the currently >>> best trade-off solution after considering the backward compatibility. I >>> would suggest firing a follow-up ticket after the PR to take care of the >>> new metric for the internal traffic in the future. >>> >>> Best regards, >>> Jing >>> >>> >>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote: >>> >>>> Hi Jing, >>>> >>>> Thanks for the reply! >>>> >>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on >>>> SinkWriterOperator to reflect the traffic to the external system for >>>> compatibility with old versions before 1.15, and make numXXXSend have the >>>> same value as numXXXOut for compatibility within 1.15. That means both >>>> numXXXOut and numXXXSend are used for external data transfers, which end >>>> users care more about. As for the internal traffic within the sink, we >>>> could name a new metric for it because this is a _new_ feature in the _new_ >>>> sink, and end users usually don’t pay attention to internal implementation. >>>> The name of the new metric could be discussed later after 1.16 release. >>>> >>>> > but it might end up with monitoring unexpected metrics, which is >>>> even worse for users, i.e. I didn't change anything, but something has been >>>> broken since the last update. >>>> >>>> Yeah this is exactly what we are trying to fix with this proposal. I >>>> believe users are more concerned with the output to the external system >>>> than the internal data delivery in the sink, so I think we’ll have more >>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor >>>> the output of the job, but after upgrading to 1.15 this value is extremely >>>> low and I didn’t change anything” if we stick to the current situation. I >>>> think only a few end users care about the number of committables sending to >>>> downstream as most of them don’t care how the sink works. >>>> >>>> We do need a re-design to fully distinguish the internal and external >>>> traffic on metrics, not only in sink but in all operators as it’s quite >>>> common for operators to make IO. This needs time to design, discuss, adjust >>>> and vote, but considering this is blocking 1.16, maybe it’s better to >>>> rescue the compatibility for now, and leave the huge reconstruction to >>>> future versions (maybe 2.0). >>>> >>>> Best, >>>> Qingsheng >>>> >>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <j...@ververica.com> wrote: >>>> >>>>> Hi Qingsheng, >>>>> >>>>> Just want to make sure we are on the same page. Are you suggesting >>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all >>>>> the changes we did with FLINK-26126 and FLINK-26492? >>>>> >>>>> For the naming switch, please pay attention that the behaviour has >>>>> been changed since we introduced SinkV2[1]. So, please be aware of >>>>> different numbers(behaviour change) even with the same metrics name. >>>>> Sticking with the old name with the new behaviour (very bad idea, IMHO) >>>>> might seem like saving the effort in the first place, but it might end up >>>>> with monitoring unexpected metrics, which is even worse for users, i.e. I >>>>> didn't change anything, but something has been broken since >>>>> the last update. >>>>> >>>>> For reverting, I am not sure how to fix the issue mentioned in >>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed >>>>> out, with SinkV2 we have two different output lines - one with the >>>>> external >>>>> system and the other with the downstream operator. In this case, >>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The >>>>> "numXXXOut" metric can still be used, depending on what the user wants to >>>>> monitor. >>>>> >>>>> >>>>> Best regards, >>>>> Jing >>>>> >>>>> [1] >>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48 >>>>> >>>>> >>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> >>>>> wrote: >>>>> >>>>>> As a supplement, considering it could be a big reconstruction >>>>>> redefining internal and external traffic and touching metric names in >>>>>> almost all operators, this requires a lot of discussions and we might >>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker >>>>>> in front of us, as the output of sink is a metric that users care a >>>>>> lot about. >>>>>> >>>>>> Thanks, >>>>>> Qingsheng >>>>>> >>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> >>>>>> wrote: >>>>>> > >>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer >>>>>> > metric definition distinguishing internal and external data >>>>>> transfers. >>>>>> > As you described, having IO in operators is quite common such as >>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP >>>>>> and >>>>>> > an overall design. >>>>>> > >>>>>> > However I think it's necessary to change the metric back to >>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and >>>>>> > 1.16. The most important argument is for compatibility as I >>>>>> mentioned >>>>>> > in my previous email, otherwise all users have to modify their >>>>>> configs >>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom >>>>>> > connectors have to change their implementations to migrate to the >>>>>> new >>>>>> > metric name. I believe other ones participating and approving this >>>>>> > proposal share the same concern about compatibility too. Also >>>>>> > considering this issue is blocking the release of 1.16, maybe we >>>>>> could >>>>>> > fix this asap, and as for defining a new metric for internal data >>>>>> > transfers we can have an in-depth discussion later. WDYT? >>>>>> > >>>>>> > Best, >>>>>> > Qingsheng >>>>>> > >>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler < >>>>>> ches...@apache.org> wrote: >>>>>> > > >>>>>> > > Currently I think that would be a mistake. >>>>>> > > >>>>>> > > Ultimately what we have here is the culmination of us never >>>>>> really considering how the numRecordsOut metric should behave for >>>>>> operators >>>>>> that emit data to other operators _and_ external systems. This goes >>>>>> beyond >>>>>> sinks. >>>>>> > > This even applies to numRecordsIn, for cases where functions >>>>>> query/write data from/to the outside, (e.g., Async IO). >>>>>> > > >>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal >>>>>> data transfers, and 1 exclusively for external data transfers, is the >>>>>> only >>>>>> way to get a consistent metric definition in the long-run. >>>>>> > > We can jump back-and-forth now or just commit to it. >>>>>> > > >>>>>> > > I don't think we can really judge this based on FLIP-33. It was >>>>>> IIRC written before the two phase sinks were added, which heavily blurred >>>>>> the lines of what a sink even is. Because it definitely is _not_ the last >>>>>> operator in a chain anymore. >>>>>> > > >>>>>> > > What I would suggest is to stick with what we got (although I >>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for >>>>>> all non-TwoPhaseCommittingSink. >>>>>> > > >>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote: >>>>>> > > >>>>>> > > Thanks for the details Chesnay! >>>>>> > > >>>>>> > > By “alias” I mean to respect the original definition made in >>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the >>>>>> external system, and keep numRecordsSend as the same value as >>>>>> numRecordsOut >>>>>> for compatibility. >>>>>> > > >>>>>> > > I think keeping numRecordsOut for the output to the external >>>>>> system is more intuitive to end users because in most cases the metric of >>>>>> data flow output is more essential. I agree with you that a new metric is >>>>>> required, but considering compatibility and users’ intuition I prefer to >>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new >>>>>> metric for sink writer’s output to downstream operators. This might be >>>>>> against consistency with metrics in other operators in Flink but maybe >>>>>> it’s >>>>>> acceptable to have the sink as a special case. >>>>>> > > >>>>>> > > Best, >>>>>> > > Qingsheng >>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler < >>>>>> ches...@apache.org>, wrote: >>>>>> > > >>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of >>>>>> numXXXOut >>>>>> > > >>>>>> > > But that's not possible. If it were that simple there would have >>>>>> never been a need to introduce another metric in the first place. >>>>>> > > >>>>>> > > It's a rather fundamental issue with how the new sinks work, in >>>>>> that they emit data to the external system (usually considered as >>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream >>>>>> operator (usually considered as "numRecordsOut" of tasks). >>>>>> > > The original issue was that the numRecordsOut of the sink counted >>>>>> both (which is completely wrong). >>>>>> > > >>>>>> > > A new metric was always required; otherwise you inevitably end up >>>>>> breaking some semantic. >>>>>> > > Adding a new metric for what the sink writes to the external >>>>>> system is, for better or worse, more consistent with how these metrics >>>>>> usually work in Flink. >>>>>> > > >>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote: >>>>>> > > >>>>>> > > Thanks everyone for joining the discussion! >>>>>> > > >>>>>> > > > Do you have any idea what has happened in the process here? >>>>>> > > >>>>>> > > The discussion in this PR [1] shows some details and could be >>>>>> helpful to understand the original motivation of the renaming. We do >>>>>> have a >>>>>> test case for guarding metrics but unfortunaly the case was also modified >>>>>> so the defense was broken. >>>>>> > > >>>>>> > > I think the reason why both the developer and the reviewer forgot >>>>>> to trigger an discussion and gave a green pass on the change is that >>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by >>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and >>>>>> should be treated carefully while contributing and reviewing. >>>>>> > > >>>>>> > > IMHO three actions could be made to prevent this kind of changes >>>>>> in the future: >>>>>> > > >>>>>> > > a. Add test case for metrics (which we already have in >>>>>> SinkMetricsITCase) >>>>>> > > b. We emphasize that any public-interface breaking changes should >>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed >>>>>> in >>>>>> the release note. >>>>>> > > c. We remind contributors and reviewers about what should be >>>>>> considered as public API, and include metric names in it. >>>>>> > > >>>>>> > > For b and c these two pages [2][3] might be proper places. >>>>>> > > >>>>>> > > About the patch to revert this, it looks like we have a consensus >>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I >>>>>> didn’t see complaints about this for now so it should be OK to save the >>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias >>>>>> of >>>>>> numXXXOut considering there could possibly some users have already >>>>>> adapted >>>>>> their system to the new naming, and have another internal metric for >>>>>> reflecting number of outgoing committable batches (actually the >>>>>> numRecordsIn of sink committer operator should be carrying this info >>>>>> already). >>>>>> > > >>>>>> > > [1] https://github.com/apache/flink/pull/18825 >>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html >>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html >>>>>> > > >>>>>> > > Best, >>>>>> > > Qingsheng >>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <tonysong...@gmail.com>, >>>>>> wrote: >>>>>> > > >>>>>> > > +1 for reverting these changes in Flink 1.16. >>>>>> > > >>>>>> > > For 1.15.3, can we make these metrics available via both names >>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who >>>>>> already migrated to 1.15 and numXXXSend. That means we still need to >>>>>> change >>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is >>>>>> internal to Flink sink. >>>>>> > > >>>>>> > > I'm overall +1 to change numXXXOut back to its original >>>>>> semantics. AFAIK (from meetup / flink-forward questionaires), most users >>>>>> do >>>>>> not migrate to a new Flink release immediately, until the next 1-2 major >>>>>> releases are out. >>>>>> > > >>>>>> > > Best, >>>>>> > > >>>>>> > > Xintong >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser < >>>>>> martijnvis...@apache.org> wrote: >>>>>> > >> >>>>>> > >> Hi Qingsheng, >>>>>> > >> >>>>>> > >> Do you have any idea what has happened in the process here? Do >>>>>> we know why >>>>>> > >> they were changed? I was under the impression that these metric >>>>>> names were >>>>>> > >> newly introduced due to the new interfaces and because it still >>>>>> depends on >>>>>> > >> each connector implementing these. >>>>>> > >> >>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as >>>>>> a public >>>>>> > >> API. Might make sense to have a separate follow-up to add that >>>>>> to the list >>>>>> > >> (I do think we should list them there). >>>>>> > >> >>>>>> > >> +1 for reverting this and make this change in Flink 1.16 >>>>>> > >> >>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: >>>>>> I think the >>>>>> > >> impact is too big for a patch version, especially given how long >>>>>> Flink 1.15 >>>>>> > >> is already out there. >>>>>> > >> >>>>>> > >> Best regards, >>>>>> > >> >>>>>> > >> Martijn >>>>>> > >> >>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xbjt...@gmail.com> >>>>>> wrote: >>>>>> > >> >>>>>> > >> > Thanks Qingsheng for starting this thread. >>>>>> > >> > >>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix >>>>>> this >>>>>> > >> > inconsistent behavior. >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > Best, >>>>>> > >> > Leonard >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <imj...@gmail.com> 写道: >>>>>> > >> > >>>>>> > >> > Thanks for discovering this problem, Qingsheng! >>>>>> > >> > >>>>>> > >> > I'm also +1 for reverting the breaking changes. >>>>>> > >> > >>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the >>>>>> new and old >>>>>> > >> > sink is inconsistent. >>>>>> > >> > We have to break one of them to have consistent behavior. Sink >>>>>> V2 is an >>>>>> > >> > evolving API which is just introduced in 1.15. >>>>>> > >> > I think it makes sense to break the unstable API instead of >>>>>> the stable API >>>>>> > >> > which many connectors and users depend on. >>>>>> > >> > >>>>>> > >> > Best, >>>>>> > >> > Jark >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li < >>>>>> jingsongl...@gmail.com> wrote: >>>>>> > >> > >>>>>> > >> >> Thanks for driving, Qingsheng. >>>>>> > >> >> >>>>>> > >> >> +1 for reverting sink metric name. >>>>>> > >> >> >>>>>> > >> >> We often forget that metric is also one of the important APIs. >>>>>> > >> >> >>>>>> > >> >> +1 for releasing 1.15.3 to fix this. >>>>>> > >> >> >>>>>> > >> >> Best, >>>>>> > >> >> Jingsong >>>>>> > >> >> >>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin < >>>>>> becket....@gmail.com> wrote: >>>>>> > >> >> > >>>>>> > >> >> > Thanks for raising the discussion, Qingsheng, >>>>>> > >> >> > >>>>>> > >> >> > +1 on reverting the breaking changes. >>>>>> > >> >> > >>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this >>>>>> and update >>>>>> > >> >> the previous release docs with this known issue, so that >>>>>> users can upgrade >>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some >>>>>> backwards >>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking >>>>>> changes like >>>>>> > >> >> this in the future. >>>>>> > >> >> > >>>>>> > >> >> > Thanks, >>>>>> > >> >> > >>>>>> > >> >> > Jiangjie (Becket) Qin >>>>>> > >> >> > >>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren < >>>>>> re...@apache.org> wrote: >>>>>> > >> >> >> >>>>>> > >> >> >> Hi devs and users, >>>>>> > >> >> >> >>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking >>>>>> change about >>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 >>>>>> [2]. >>>>>> > >> >> >> >>>>>> > >> >> >> TL;DR >>>>>> > >> >> >> >>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 >>>>>> are replace >>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering >>>>>> metric names >>>>>> > >> >> are public APIs, this is a breaking change to end users and >>>>>> not backward >>>>>> > >> >> compatible. Also unfortunately this breaking change was not >>>>>> discussed in >>>>>> > >> >> the mailing list before. >>>>>> > >> >> >> >>>>>> > >> >> >> Background >>>>>> > >> >> >> >>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been >>>>>> changed so >>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” >>>>>> is used for >>>>>> > >> >> reporting the total number of output records since the sink >>>>>> started (number >>>>>> > >> >> of records written to the external system), and similarly for >>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, >>>>>> “numBytesOutPerSecond” and >>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming >>>>>> and definition. >>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as >>>>>> “numXXXOut” >>>>>> > >> >> could be used by the output of SinkWriterOperator for >>>>>> reporting number of >>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to >>>>>> resolve the >>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these >>>>>> metrics with >>>>>> > >> >> “numXXXSend”. >>>>>> > >> >> >> >>>>>> > >> >> >> Necessity of reverting this change >>>>>> > >> >> >> >>>>>> > >> >> >> - Metric names are actually public API, as end users need >>>>>> to configure >>>>>> > >> >> metric collecting and alerting system with metric names. >>>>>> Users have to >>>>>> > >> >> reset all configurations related to affected metrics. >>>>>> > >> >> >> - This could also affect custom and external sinks not >>>>>> maintained by >>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics. >>>>>> > >> >> >> - The number of records sent to external system is way >>>>>> more important >>>>>> > >> >> than the number of Committables sent to >>>>>> SinkCommitterOperator, as the >>>>>> > >> >> latter one is just an internal implementation of sink. We >>>>>> could have a new >>>>>> > >> >> metric name for the latter one instead. >>>>>> > >> >> >> - We could avoid splitting the project by version (like >>>>>> “plz use >>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert >>>>>> it ASAP, >>>>>> > >> >> cosidering 1.16 is still not released for now. >>>>>> > >> >> >> >>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users >>>>>> about your >>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”. >>>>>> > >> >> >> >>>>>> > >> >> >> Looking forward to your reply! >>>>>> > >> >> >> >>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126 >>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492 >>>>>> > >> >> >> [1] FLIP-33, version 18: >>>>>> > >> >> >>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136 >>>>>> > >> >> >> >>>>>> > >> >> >> Best, >>>>>> > >> >> >> Qingsheng >>>>>> > >> >> >>>>>> > >> > >>>>>> > >> > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> >>>>>