Hi devs and users, It looks like we are getting an initial consensus in the discussion so I started a voting thread [1] just now. Looking forward to your feedback!
[1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6 Best, Qingsheng On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <j...@ververica.com> wrote: > Hi Qingsheng, > > Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut > and numXXXSend to the same external data transfer metric does not really > break the new SinkV2 design, since there was no requirement to monitor the > internal traffic. So, I think both developer and user can live with it. It > might not be the perfect solution but is indeed the currently best > trade-off solution after considering the backward compatibility. I would > suggest firing a follow-up ticket after the PR to take care of the new > metric for the internal traffic in the future. > > Best regards, > Jing > > > On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote: > >> Hi Jing, >> >> Thanks for the reply! >> >> Let me rephrase my proposal: we’d like to use numXXXOut registered on >> SinkWriterOperator to reflect the traffic to the external system for >> compatibility with old versions before 1.15, and make numXXXSend have the >> same value as numXXXOut for compatibility within 1.15. That means both >> numXXXOut and numXXXSend are used for external data transfers, which end >> users care more about. As for the internal traffic within the sink, we >> could name a new metric for it because this is a _new_ feature in the _new_ >> sink, and end users usually don’t pay attention to internal implementation. >> The name of the new metric could be discussed later after 1.16 release. >> >> > but it might end up with monitoring unexpected metrics, which is even >> worse for users, i.e. I didn't change anything, but something has been >> broken since the last update. >> >> Yeah this is exactly what we are trying to fix with this proposal. I >> believe users are more concerned with the output to the external system >> than the internal data delivery in the sink, so I think we’ll have more >> cases reporting like “I set up a panel on numRecordsOut in sink to monitor >> the output of the job, but after upgrading to 1.15 this value is extremely >> low and I didn’t change anything” if we stick to the current situation. I >> think only a few end users care about the number of committables sending to >> downstream as most of them don’t care how the sink works. >> >> We do need a re-design to fully distinguish the internal and external >> traffic on metrics, not only in sink but in all operators as it’s quite >> common for operators to make IO. This needs time to design, discuss, adjust >> and vote, but considering this is blocking 1.16, maybe it’s better to >> rescue the compatibility for now, and leave the huge reconstruction to >> future versions (maybe 2.0). >> >> Best, >> Qingsheng >> >> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <j...@ververica.com> wrote: >> >>> Hi Qingsheng, >>> >>> Just want to make sure we are on the same page. Are you suggesting >>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all >>> the changes we did with FLINK-26126 and FLINK-26492? >>> >>> For the naming switch, please pay attention that the behaviour has been >>> changed since we introduced SinkV2[1]. So, please be aware of different >>> numbers(behaviour change) even with the same metrics name. Sticking with >>> the old name with the new behaviour (very bad idea, IMHO) might seem like >>> saving the effort in the first place, but it might end up with monitoring >>> unexpected metrics, which is even worse for users, i.e. I didn't change >>> anything, but something has been broken since the last update. >>> >>> For reverting, I am not sure how to fix the issue mentioned in >>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed >>> out, with SinkV2 we have two different output lines - one with the external >>> system and the other with the downstream operator. In this case, >>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The >>> "numXXXOut" metric can still be used, depending on what the user wants to >>> monitor. >>> >>> >>> Best regards, >>> Jing >>> >>> [1] >>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48 >>> >>> >>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote: >>> >>>> As a supplement, considering it could be a big reconstruction >>>> redefining internal and external traffic and touching metric names in >>>> almost all operators, this requires a lot of discussions and we might >>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker >>>> in front of us, as the output of sink is a metric that users care a >>>> lot about. >>>> >>>> Thanks, >>>> Qingsheng >>>> >>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote: >>>> > >>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer >>>> > metric definition distinguishing internal and external data transfers. >>>> > As you described, having IO in operators is quite common such as >>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and >>>> > an overall design. >>>> > >>>> > However I think it's necessary to change the metric back to >>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and >>>> > 1.16. The most important argument is for compatibility as I mentioned >>>> > in my previous email, otherwise all users have to modify their configs >>>> > of metric systems after upgrading to Flink 1.15+, and all custom >>>> > connectors have to change their implementations to migrate to the new >>>> > metric name. I believe other ones participating and approving this >>>> > proposal share the same concern about compatibility too. Also >>>> > considering this issue is blocking the release of 1.16, maybe we could >>>> > fix this asap, and as for defining a new metric for internal data >>>> > transfers we can have an in-depth discussion later. WDYT? >>>> > >>>> > Best, >>>> > Qingsheng >>>> > >>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ches...@apache.org> >>>> wrote: >>>> > > >>>> > > Currently I think that would be a mistake. >>>> > > >>>> > > Ultimately what we have here is the culmination of us never really >>>> considering how the numRecordsOut metric should behave for operators that >>>> emit data to other operators _and_ external systems. This goes beyond >>>> sinks. >>>> > > This even applies to numRecordsIn, for cases where functions >>>> query/write data from/to the outside, (e.g., Async IO). >>>> > > >>>> > > Having 2 separate metrics for that, 1 exclusively for internal data >>>> transfers, and 1 exclusively for external data transfers, is the only way >>>> to get a consistent metric definition in the long-run. >>>> > > We can jump back-and-forth now or just commit to it. >>>> > > >>>> > > I don't think we can really judge this based on FLIP-33. It was >>>> IIRC written before the two phase sinks were added, which heavily blurred >>>> the lines of what a sink even is. Because it definitely is _not_ the last >>>> operator in a chain anymore. >>>> > > >>>> > > What I would suggest is to stick with what we got (although I >>>> despise the name numRecordsSend), and alias the numRecordsOut metric for >>>> all non-TwoPhaseCommittingSink. >>>> > > >>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote: >>>> > > >>>> > > Thanks for the details Chesnay! >>>> > > >>>> > > By “alias” I mean to respect the original definition made in >>>> FLIP-33 for numRecordsOut, which is the number of records written to the >>>> external system, and keep numRecordsSend as the same value as numRecordsOut >>>> for compatibility. >>>> > > >>>> > > I think keeping numRecordsOut for the output to the external system >>>> is more intuitive to end users because in most cases the metric of data >>>> flow output is more essential. I agree with you that a new metric is >>>> required, but considering compatibility and users’ intuition I prefer to >>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new >>>> metric for sink writer’s output to downstream operators. This might be >>>> against consistency with metrics in other operators in Flink but maybe it’s >>>> acceptable to have the sink as a special case. >>>> > > >>>> > > Best, >>>> > > Qingsheng >>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ches...@apache.org>, >>>> wrote: >>>> > > >>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of >>>> numXXXOut >>>> > > >>>> > > But that's not possible. If it were that simple there would have >>>> never been a need to introduce another metric in the first place. >>>> > > >>>> > > It's a rather fundamental issue with how the new sinks work, in >>>> that they emit data to the external system (usually considered as >>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream >>>> operator (usually considered as "numRecordsOut" of tasks). >>>> > > The original issue was that the numRecordsOut of the sink counted >>>> both (which is completely wrong). >>>> > > >>>> > > A new metric was always required; otherwise you inevitably end up >>>> breaking some semantic. >>>> > > Adding a new metric for what the sink writes to the external system >>>> is, for better or worse, more consistent with how these metrics usually >>>> work in Flink. >>>> > > >>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote: >>>> > > >>>> > > Thanks everyone for joining the discussion! >>>> > > >>>> > > > Do you have any idea what has happened in the process here? >>>> > > >>>> > > The discussion in this PR [1] shows some details and could be >>>> helpful to understand the original motivation of the renaming. We do have a >>>> test case for guarding metrics but unfortunaly the case was also modified >>>> so the defense was broken. >>>> > > >>>> > > I think the reason why both the developer and the reviewer forgot >>>> to trigger an discussion and gave a green pass on the change is that >>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by >>>> Martijn I couldn’t find a place noting that metrics are public APIs and >>>> should be treated carefully while contributing and reviewing. >>>> > > >>>> > > IMHO three actions could be made to prevent this kind of changes in >>>> the future: >>>> > > >>>> > > a. Add test case for metrics (which we already have in >>>> SinkMetricsITCase) >>>> > > b. We emphasize that any public-interface breaking changes should >>>> be proposed by a FLIP or discussed in mailing list, and should be listed in >>>> the release note. >>>> > > c. We remind contributors and reviewers about what should be >>>> considered as public API, and include metric names in it. >>>> > > >>>> > > For b and c these two pages [2][3] might be proper places. >>>> > > >>>> > > About the patch to revert this, it looks like we have a consensus >>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I >>>> didn’t see complaints about this for now so it should be OK to save the >>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of >>>> numXXXOut considering there could possibly some users have already adapted >>>> their system to the new naming, and have another internal metric for >>>> reflecting number of outgoing committable batches (actually the >>>> numRecordsIn of sink committer operator should be carrying this info >>>> already). >>>> > > >>>> > > [1] https://github.com/apache/flink/pull/18825 >>>> > > [2] https://flink.apache.org/contributing/contribute-code.html >>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html >>>> > > >>>> > > Best, >>>> > > Qingsheng >>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <tonysong...@gmail.com>, >>>> wrote: >>>> > > >>>> > > +1 for reverting these changes in Flink 1.16. >>>> > > >>>> > > For 1.15.3, can we make these metrics available via both names >>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who >>>> already migrated to 1.15 and numXXXSend. That means we still need to change >>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is >>>> internal to Flink sink. >>>> > > >>>> > > I'm overall +1 to change numXXXOut back to its original semantics. >>>> AFAIK (from meetup / flink-forward questionaires), most users do not >>>> migrate to a new Flink release immediately, until the next 1-2 major >>>> releases are out. >>>> > > >>>> > > Best, >>>> > > >>>> > > Xintong >>>> > > >>>> > > >>>> > > >>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser < >>>> martijnvis...@apache.org> wrote: >>>> > >> >>>> > >> Hi Qingsheng, >>>> > >> >>>> > >> Do you have any idea what has happened in the process here? Do we >>>> know why >>>> > >> they were changed? I was under the impression that these metric >>>> names were >>>> > >> newly introduced due to the new interfaces and because it still >>>> depends on >>>> > >> each connector implementing these. >>>> > >> >>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a >>>> public >>>> > >> API. Might make sense to have a separate follow-up to add that to >>>> the list >>>> > >> (I do think we should list them there). >>>> > >> >>>> > >> +1 for reverting this and make this change in Flink 1.16 >>>> > >> >>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I >>>> think the >>>> > >> impact is too big for a patch version, especially given how long >>>> Flink 1.15 >>>> > >> is already out there. >>>> > >> >>>> > >> Best regards, >>>> > >> >>>> > >> Martijn >>>> > >> >>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xbjt...@gmail.com> >>>> wrote: >>>> > >> >>>> > >> > Thanks Qingsheng for starting this thread. >>>> > >> > >>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this >>>> > >> > inconsistent behavior. >>>> > >> > >>>> > >> > >>>> > >> > Best, >>>> > >> > Leonard >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > 2022年10月10日 下午3:06,Jark Wu <imj...@gmail.com> 写道: >>>> > >> > >>>> > >> > Thanks for discovering this problem, Qingsheng! >>>> > >> > >>>> > >> > I'm also +1 for reverting the breaking changes. >>>> > >> > >>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new >>>> and old >>>> > >> > sink is inconsistent. >>>> > >> > We have to break one of them to have consistent behavior. Sink >>>> V2 is an >>>> > >> > evolving API which is just introduced in 1.15. >>>> > >> > I think it makes sense to break the unstable API instead of the >>>> stable API >>>> > >> > which many connectors and users depend on. >>>> > >> > >>>> > >> > Best, >>>> > >> > Jark >>>> > >> > >>>> > >> > >>>> > >> > >>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li < >>>> jingsongl...@gmail.com> wrote: >>>> > >> > >>>> > >> >> Thanks for driving, Qingsheng. >>>> > >> >> >>>> > >> >> +1 for reverting sink metric name. >>>> > >> >> >>>> > >> >> We often forget that metric is also one of the important APIs. >>>> > >> >> >>>> > >> >> +1 for releasing 1.15.3 to fix this. >>>> > >> >> >>>> > >> >> Best, >>>> > >> >> Jingsong >>>> > >> >> >>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin < >>>> becket....@gmail.com> wrote: >>>> > >> >> > >>>> > >> >> > Thanks for raising the discussion, Qingsheng, >>>> > >> >> > >>>> > >> >> > +1 on reverting the breaking changes. >>>> > >> >> > >>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this >>>> and update >>>> > >> >> the previous release docs with this known issue, so that users >>>> can upgrade >>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some >>>> backwards >>>> > >> >> compatibility tests on metrics to avoid unintended breaking >>>> changes like >>>> > >> >> this in the future. >>>> > >> >> > >>>> > >> >> > Thanks, >>>> > >> >> > >>>> > >> >> > Jiangjie (Becket) Qin >>>> > >> >> > >>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren < >>>> re...@apache.org> wrote: >>>> > >> >> >> >>>> > >> >> >> Hi devs and users, >>>> > >> >> >> >>>> > >> >> >> I’d like to start a discussion about reverting a breaking >>>> change about >>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 >>>> [2]. >>>> > >> >> >> >>>> > >> >> >> TL;DR >>>> > >> >> >> >>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 >>>> are replace >>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering >>>> metric names >>>> > >> >> are public APIs, this is a breaking change to end users and not >>>> backward >>>> > >> >> compatible. Also unfortunately this breaking change was not >>>> discussed in >>>> > >> >> the mailing list before. >>>> > >> >> >> >>>> > >> >> >> Background >>>> > >> >> >> >>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been >>>> changed so >>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” >>>> is used for >>>> > >> >> reporting the total number of output records since the sink >>>> started (number >>>> > >> >> of records written to the external system), and similarly for >>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” >>>> and >>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and >>>> definition. >>>> > >> >> However, these metrics are ambiguous in the new Sink API as >>>> “numXXXOut” >>>> > >> >> could be used by the output of SinkWriterOperator for reporting >>>> number of >>>> > >> >> Committables delivered to SinkCommitterOperator. In order to >>>> resolve the >>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these >>>> metrics with >>>> > >> >> “numXXXSend”. >>>> > >> >> >> >>>> > >> >> >> Necessity of reverting this change >>>> > >> >> >> >>>> > >> >> >> - Metric names are actually public API, as end users need to >>>> configure >>>> > >> >> metric collecting and alerting system with metric names. Users >>>> have to >>>> > >> >> reset all configurations related to affected metrics. >>>> > >> >> >> - This could also affect custom and external sinks not >>>> maintained by >>>> > >> >> Flink, which might have implemented with numXXXOut metrics. >>>> > >> >> >> - The number of records sent to external system is way more >>>> important >>>> > >> >> than the number of Committables sent to SinkCommitterOperator, >>>> as the >>>> > >> >> latter one is just an internal implementation of sink. We could >>>> have a new >>>> > >> >> metric name for the latter one instead. >>>> > >> >> >> - We could avoid splitting the project by version (like “plz >>>> use >>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert >>>> it ASAP, >>>> > >> >> cosidering 1.16 is still not released for now. >>>> > >> >> >> >>>> > >> >> >> As a consequence, I’d like to hear from devs and users about >>>> your >>>> > >> >> opinion on changing these metrics back to “numXXXOut”. >>>> > >> >> >> >>>> > >> >> >> Looking forward to your reply! >>>> > >> >> >> >>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126 >>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492 >>>> > >> >> >> [1] FLIP-33, version 18: >>>> > >> >> >>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136 >>>> > >> >> >> >>>> > >> >> >> Best, >>>> > >> >> >> Qingsheng >>>> > >> >> >>>> > >> > >>>> > >> > >>>> > > >>>> > > >>>> > > >>>> >>>