Hi Xingtong, Thank you for your suggestion.
After considering the idea of using a general configuration key, I think it may not be a good idea for the reasons below. While I agree that using a more general configuration key provides us with the flexibility to switch to other approaches to calculate the lag in the future, the downside is that it may cause confusion for users. We currently have fetchEventTimeLag, emitEventTimeLag, and watermarkLag in the source, and it is not clear which specific lag we are referring to. With the potential introduction of the Generalized Watermark mechanism in the future, if I understand correctly, a watermark won't necessarily need to be a timestamp. I am concern that the general configuration key may not be enough to cover all the use case and we will need to introduce a general way to determine the backlog status regardless. For the reasons above, I prefer introducing the configuration as is, and change it later with the a deprecation process or migration process. What do you think? Best, Xuannan On Aug 14, 2023, 14:09 +0800, Xintong Song <tonysong...@gmail.com>, wrote: > Thanks for the explanation. > > I wonder if it makes sense to not expose this detail via the configuration > option. To be specific, I suggest not mentioning the "watermark" keyword in > the configuration key and description. > > - From the users' perspective, I think they only need to know there's a > lag higher than the given threshold, Flink will consider latency of > individual records as less important and prioritize throughput over it. > They don't really need the details of how the lags are calculated. > - For the internal implementation, I also think using watermark lags is > a good idea, for the reasons you've already mentioned. However, it's not > the only possible option. Hiding this detail from users would give us the > flexibility to switch to other approaches if needed in future. > - We are currently working on designing the ProcessFunction API > (consider it as a DataStream API V2). There's an idea to introduce a > Generalized Watermark mechanism, where basically the watermark can be > anything that needs to travel along the data-flow with certain alignment > strategies, and event time watermark would be one specific case of it. This > is still an idea and has not been discussed and agreed on by the community, > and we are preparing a FLIP for it. But if we are going for it, the concept > "watermark-lag-threshold" could be ambiguous. > > I do not intend to block the FLIP on this. I'd also be fine with > introducing the configuration as is, and changing it later, if needed, with > a regular deprecation and migration process. Just making my suggestions. > > > Best, > > Xintong > > > > On Mon, Aug 14, 2023 at 12:00 PM Xuannan Su <suxuanna...@gmail.com> wrote: > > > Hi Xintong, > > > > Thanks for the reply. > > > > I have considered using the timestamp in the records to determine the > > backlog status, and decided to use watermark at the end. By definition, > > watermark is the time progress indication in the data stream. It indicates > > the stream’s event time has progressed to some specific time. On the other > > hand, timestamp in the records is usually used to generate the watermark. > > Therefore, it appears more appropriate and intuitive to calculate the event > > time lag by watermark and determine the backlog status. And by using the > > watermark, we can easily deal with the out-of-order and the idleness of the > > data. > > > > Please let me know if you have further questions. > > > > Best, > > Xuannan > > On Aug 10, 2023, 20:23 +0800, Xintong Song <tonysong...@gmail.com>, wrote: > > > Thanks for preparing the FLIP, Xuannan. > > > > > > +1 in general. > > > > > > A quick question, could you explain why we are relying on the watermark > > for > > > emitting the record attribute? Why not use timestamps in the records? I > > > don't see any concern in using watermarks. Just wondering if there's any > > > deep considerations behind this. > > > > > > Best, > > > > > > Xintong > > > > > > > > > > > > On Thu, Aug 3, 2023 at 3:03 PM Xuannan Su <suxuanna...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > I am opening this thread to discuss FLIP-328: Allow source operators to > > > > determine isProcessingBacklog based on watermark lag[1]. We had a > > several > > > > discussions with Dong Ling about the design, and thanks for all the > > > > valuable advice. > > > > > > > > The FLIP aims to target the use-case where user want to run a Flink > > job to > > > > backfill historical data in a high throughput manner and continue > > > > processing real-time data with low latency. Building upon the backlog > > > > concept introduced in FLIP-309[2], this proposal enables sources to > > report > > > > their status of processing backlog based on the watermark lag. > > > > > > > > We would greatly appreciate any comments or feedback you may have on > > this > > > > proposal. > > > > > > > > Best, > > > > Xuannan > > > > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-328%3A+Allow+source+operators+to+determine+isProcessingBacklog+based+on+watermark+lag > > > > [2] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog > > > > > >