Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Leonard Xu Tue, 18 Jul 2023 19:41:06 -0700

+1 for interval-during-backlog


best，
leonard

> On Jul 14, 2023, at 11:38 PM, Piotr Nowojski <piotr.nowoj...@gmail.com> wrote:
> 
> Hi All,
> 
> We had a lot of off-line discussions. As a result I would suggest dropping
> the idea of introducing an end-to-end-latency concept, until
> we can properly implement it, which will require more designing and
> experimenting. I would suggest starting with a more manual solution,
> where the user needs to configure concrete parameters, like
> `execution.checkpointing.max-interval` or `execution.flush-interval`.
> 
> FLIP-309 looks good to me, I would just rename
> `execution.checkpointing.interval-during-backlog` to
> `execution.checkpointing.max-interval`.
> 
> I would also reference future work, that a solution that would allow set
> `isProcessingBacklog` for sources like Kafka will be introduced via
> FLIP-328 [1].
> 
> Best,
> Piotrek
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-328%3A+Allow+source+operators+to+determine+isProcessingBacklog+based+on+watermark+lag
> 
> śr., 12 lip 2023 o 03:49 Dong Lin <lindon...@gmail.com> napisał(a):
> 
>> Hi Piotr,
>> 
>> I think I understand your motivation for suggeseting
>> execution.slow-end-to-end-latency now. Please see my followup comments
>> (after the previous email) inline.
>> 
>> On Wed, Jul 12, 2023 at 12:32 AM Piotr Nowojski <pnowoj...@apache.org>
>> wrote:
>> 
>>> Hi Dong,
>>> 
>>> Thanks for the updates, a couple of comments:
>>> 
>>>> If a record is generated by a source when the source's
>>> isProcessingBacklog is true, or some of the records used to
>>>> derive this record (by an operator) has isBacklog = true, then this
>>> record should have isBacklog = true. Otherwise,
>>>> this record should have isBacklog = false.
>>> 
>>> nit:
>>> I think this conflicts with "Rule of thumb for non-source operators to
>> set
>>> isBacklog = true for the records it emits:"
>>> section later on, when it comes to a case if an operator has mixed
>>> isBacklog = false and isBacklog = true inputs.
>>> 
>>>> execution.checkpointing.interval-during-backlog
>>> 
>>> Do we need to define this as an interval config parameter? Won't that add
>>> an option that will be almost instantly deprecated
>>> because what we actually would like to have is:
>>> execution.slow-end-to-end-latency and execution.end-to-end-latency
>>> 
>> 
>> I guess you are suggesting that we should allow users to specify a higher
>> end-to-end latency budget for those records that are emitted by two-phase
>> commit sink, than those records that are emitted by none-two-phase commit
>> sink.
>> 
>> My concern with this approach is that it will increase the complexity of
>> the definition of "processing latency requirement", as well as the
>> complexity of the Flink runtime code that handles it. Currently, the
>> FLIP-325 defines end-to-end latency as an attribute of the records that is
>> statically assigned when the record is generated at the source, regardless
>> of how it will be emitted later in the topology. If we make the changes
>> proposed above, we would need to define the latency requirement w.r.t. the
>> attribute of the operators that it travels through before its result is
>> emitted, which is less intuitive and more complex.
>> 
>> For now, it is not clear whether it is necessary to have two categories of
>> latency requirement for the same job. Maybe it is reasonable to assume that
>> if a job has two-phase commit sink and the user is OK to emit some results
>> at 1 minute interval, then more likely than not the user is also OK to emit
>> all results at 1 minute interval, include those that go through
>> none-two-phase commit sink?
>> 
>> If we do want to support different end-to-end latency depending on whether
>> the operator is emitted by two-phase commit sink, I would prefer to still
>> use execution.checkpointing.interval-during-backlog instead of
>> execution.slow-end-to-end-latency. This allows us to keep the concept of
>> end-to-end latency simple. Also, by explicitly including "checkpointing
>> interval" in the name of the config that directly affects checkpointing
>> interval, we can make it easier and more intuitive for users to understand
>> the impact and set proper value for such configs.
>> 
>> What do you think?
>> 
>> Best,
>> Dong
>> 
>> 
>>> Maybe we can introduce only `execution.slow-end-to-end-latency` (% maybe
>> a
>>> better name), and for the time being
>>> use it as the checkpoint interval value during backlog?
>> 
>> 
>>> Or do you envision that in the future users will be configuring only:
>>> - execution.end-to-end-latency
>>> and only optionally:
>>> - execution.checkpointing.interval-during-backlog
>>> ?
>>> 
>>> Best Piotrek
>>> 
>>> PS, I will read the summary that you have just published later, but I
>> think
>>> we don't need to block this FLIP on the
>>> existence of that high level summary.
>>> 
>>> wt., 11 lip 2023 o 17:49 Dong Lin <lindon...@gmail.com> napisał(a):
>>> 
>>>> Hi Piotr and everyone,
>>>> 
>>>> I have documented the vision with a summary of the existing work in
>> this
>>>> doc. Please feel free to review/comment/edit this doc. Looking forward
>> to
>>>> working with you together in this line of work.
>>>> 
>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1CgxXvPdAbv60R9yrrQAwaRgK3aMAgAL7RPPr799tOsQ/edit?usp=sharing
>>>> 
>>>> Best,
>>>> Dong
>>>> 
>>>> On Tue, Jul 11, 2023 at 1:07 AM Piotr Nowojski <
>> piotr.nowoj...@gmail.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> Me and Dong chatted offline about the above mentioned issues (thanks
>>> for
>>>>> that offline chat
>>>>> I think it helped both of us a lot). The summary is below.
>>>>> 
>>>>>> Previously, I thought you meant to add a generic logic in
>>>>> SourceReaderBase
>>>>>> to read existing metrics (e.g. backpressure) and emit the
>>>>>> IsProcessingBacklogEvent to SourceCoordinator. I am sorry if I have
>>>>>> misunderstood your suggetions.
>>>>>> 
>>>>>> After double-checking your previous suggestion, I am wondering if
>> you
>>>> are
>>>>>> OK with the following approach:
>>>>>> 
>>>>>> - Add a job-level config
>>>> execution.checkpointing.interval-during-backlog
>>>>>> - Add an API SourceReaderContext#setProcessingBacklog(boolean
>>>>>> isProcessingBacklog).
>>>>>> - When this API is invoked, it internally sends an
>>>>>> internal SourceReaderBacklogEvent to SourceCoordinator.
>>>>>> - SourceCoordinator should keep track of the latest
>>> isProcessingBacklog
>>>>>> status from all its subtasks. And for now, we will hardcode the
>> logic
>>>>> such
>>>>>> that if any source reader says it is under backlog, then
>>>>>> execution.checkpointing.interval-during-backlog is used.
>>>>>> 
>>>>>> This approach looks good to me as it can achieve the same
>> performance
>>>>> with
>>>>>> the same number of public APIs for the target use-case. And I
>> suppose
>>>> in
>>>>>> the future we might be able to re-use this API for source reader to
>>> set
>>>>> its
>>>>>> backlog status based on its backpressure metrics, which could be an
>>>> extra
>>>>>> advantage over the current approach.
>>>>>> 
>>>>>> Do you think we can agree to adopt the approach described above?
>>>>> 
>>>>> Yes, I think that's a viable approach. I would be perfectly fine to
>> not
>>>>> introduce
>>>>> `SourceReaderContext#setProcessingBacklog(boolean
>>> isProcessingBacklog).`
>>>>> and sending the `SourceReaderBacklogEvent` from SourceReader to JM
>>>>> in this FLIP. It could be implemented once we would decide to add
>> some
>>>> more
>>>>> generic
>>>>> ways of detecting backlog/backpressure on the SourceReader level.
>>>>> 
>>>>> I think we could also just keep the current proposal of adding
>>>>> `SplitEnumeratorContext#setIsProcessingBacklog`, and use it in the
>>>> sources
>>>>> that
>>>>> can set it on the `SplitEnumerator` level. Later we could merge this
>>> with
>>>>> another
>>>>> mechanisms of detecting "isProcessingBacklog", like based on
>> watermark
>>>> lag,
>>>>> backpressure, etc, via some component running on the JM.
>>>>> 
>>>>> At the same time I'm fine with having the "isProcessingBacklog"
>> concept
>>>> to
>>>>> switch
>>>>> runtime back and forth between high and low latency modes instead of
>>>>> "backpressure". In FLIP-325 I have asked:
>>>>> 
>>>>>> I think there is one thing that hasn't been discussed neither here
>>> nor
>>>> in
>>>>> FLIP-309. Given that we have
>>>>>> three dimensions:
>>>>>> - e2e latency/checkpointing interval
>>>>>> - enabling some kind of batching/buffering on the operator level
>>>>>> - how much resources we want to allocate to the job
>>>>>> 
>>>>>> How do we want Flink to adjust itself between those three? For
>>> example:
>>>>>> a) Should we assume that given Job has a fixed amount of assigned
>>>>> resources and make it paramount that
>>>>>>  Flink doesn't exceed those available resources? So in case of
>>>>> backpressure, we
>>>>>>  should extend checkpointing intervals, emit records less
>> frequently
>>>> and
>>>>> in batches.
>>>>>> b) Or should we assume that the amount of resources is flexible (up
>>> to
>>>> a
>>>>> point?), and the desired e2e latency
>>>>>>  is the paramount aspect? So in case of backpressure, we should
>>> still
>>>>> adhere to the configured e2e latency,
>>>>>>  and wait for the user or autoscaler to scale up the job?
>>>>>> 
>>>>>> In case of a), I think the concept of "isProcessingBacklog" is not
>>>>> needed, we could steer the behaviour only
>>>>>> using the backpressure information.
>>>>>> 
>>>>>> On the other hand, in case of b), "isProcessingBacklog" information
>>>> might
>>>>> be helpful, to let Flink know that
>>>>>> we can safely decrease the e2e latency/checkpoint interval even if
>>>> there
>>>>> is no backpressure, to use fewer
>>>>>> resources (and let the autoscaler scale down the job).
>>>>>> 
>>>>>> Do we want to have both, or only one of those? Do a) and b)
>>> complement
>>>>> one another? If job is backpressured,
>>>>>> we should follow a) and expose to autoscaler/users information
>> "Hey!
>>>> I'm
>>>>> barely keeping up! I need more resources!".
>>>>>> While, when there is no backpressure and latency doesn't matter
>>>>> (isProcessingBacklog=true), we can limit the resource
>>>>>> usage
>>>>> 
>>>>> After thinking this over:
>>>>> - the case that we don't have "isProcessingBacklog" information, but
>>> the
>>>>> source operator is
>>>>>  back pressured, must be intermittent. EIther back pressure will go
>>>> away,
>>>>> or shortly we should
>>>>>  reach the "isProcessingBacklog" state anyway
>>>>> - and even if we implement some back pressure detecting algorithm to
>>>> switch
>>>>> the runtime into the
>>>>>  "high latency mode", we can always report that as
>>> "isProcessingBacklog"
>>>>> anyway, as runtime should
>>>>>   react the same way in both cases (backpressure and
>>>> "isProcessingBacklog
>>>>> states).
>>>>> 
>>>>> ===============
>>>>> 
>>>>> With a common understanding of the final solution that we want to
>> have
>>> in
>>>>> the future, I'm pretty much fine with the current
>>>>> FLIP-309 proposal, with a couple of remarks:
>>>>> 1. Could you include in the FLIP-309 the long term solution as we
>> have
>>>>> discussed.
>>>>>        a) Would be nice to have some diagram showing how the
>>>>> "isProcessingBacklog" information would be travelling,
>>>>>             being aggregated and what will be done with that
>>>> information.
>>>>> (from SourceReader/SplitEnumerator to some
>>>>>            "component" aggregating it, and then ... ?)
>>>>> 2. For me "processing backlog" doesn't necessarily equate to
>>>> "backpressure"
>>>>> (HybridSource can be
>>>>>    both NOT backpressured and processing backlog at the same time).
>> If
>>>> you
>>>>> think the same way, can you include that
>>>>>    definition of "processing backlog" in the FLIP including its
>>> relation
>>>>> to the backpressure state? If not, we need to align
>>>>>    on that definition first :)
>>>>> 
>>>>> Also I'm missing a big picture description, that would show what are
>>> you
>>>>> trying to achieve and what's the overarching vision
>>>>> behind all of the current and future FLIPs that you are planning in
>>> this
>>>>> area (FLIP-309, FLIP-325, FLIP-327, FLIP-331, ...?).
>>>>> Or was it described somewhere and I've missed it?
>>>>> 
>>>>> Best,
>>>>> Piotrek
>>>>> 
>>>>> 
>>>>> 
>>>>> czw., 6 lip 2023 o 06:25 Dong Lin <lindon...@gmail.com> napisał(a):
>>>>> 
>>>>>> Hi Piotr,
>>>>>> 
>>>>>> I am sorry if you feel unhappy or upset with us for not
>>>> following/fixing
>>>>>> your proposal. It is not my intention to give you this feeling.
>> After
>>>>> all,
>>>>>> we are all trying to make Flink better, to support more use-case
>> with
>>>> the
>>>>>> most maintainable code. I hope you understand that just like you, I
>>>> have
>>>>>> also been doing my best to think through various design options and
>>>>> taking
>>>>>> time to evalute the pros/cons. Eventually, we probably still need
>> to
>>>>> reach
>>>>>> consensus by clearly listing and comparing the objective pros/cons
>> of
>>>>>> different proposals and identifying the best choice.
>>>>>> 
>>>>>> Regarding your concern (or frustration) that we are always finding
>>>> issues
>>>>>> in your proposal, I would say it is normal (and probably necessary)
>>> for
>>>>>> developers to find pros/cons in each other's solutions, so that we
>>> can
>>>>>> eventually pick the right one. I will appreciate anyone who can
>>>> correctly
>>>>>> pinpoint the concrete issue in my proposal so that I can improve it
>>> or
>>>>>> choose an alternative solution.
>>>>>> 
>>>>>> Regarding your concern that we are not spending enough effort to
>> find
>>>>>> solutions and that the problem in your solution can be solved in a
>>>>> minute,
>>>>>> I would like to say that is not true. For each of your previous
>>>>> proposals,
>>>>>> I typically spent 1+ hours thinking through your proposal to
>>> understand
>>>>>> whether it works and why it does not work, and another 1+ hour to
>>> write
>>>>>> down the details and explain why it does not work. And I have had a
>>>>> variety
>>>>>> of offline discussions with my colleagues discussing various
>>> proposals
>>>>>> (including yours) with 6+ hours in total. Maybe I am not capable
>>> enough
>>>>> to
>>>>>> fix those issues in one minute or so so. If you think your proposal
>>> can
>>>>> be
>>>>>> easily fixed in one minute or so, I would really appreciate it if
>> you
>>>> can
>>>>>> think through your proposal and fix it in the first place :)
>>>>>> 
>>>>>> For your information, I have had several long discussions with my
>>>>>> colleagues at Alibaba and also Becket on this FLIP. We have
>> seriously
>>>>>> considered your proposals and discussed in detail what are the
>>>> pros/cons
>>>>>> and whether we can improve these solutions. The initial version of
>>> this
>>>>>> FLIP (which allows the source operator to specify checkpoint
>>> intervals)
>>>>>> does not get enough support due to concerns of not being generic
>>> (i.e.
>>>>>> users need to specify checkpoint intervals on a per-source basis).
>> It
>>>> is
>>>>>> only after I updated the FLIP to use the job-level
>>>>>> execution.checkpointing.interval-during-backlog, then they agree to
>>>> give
>>>>> +1
>>>>>> to the FLIP. What I want to tell you is that your suggestions have
>>> been
>>>>>> taken seriously, and the quality of the FLIP has been taken
>> seriously
>>>>>> by all those who have voted. As a result of taking your suggestion
>>>>>> seriously and trying to find improvements, we updated the FLIP to
>> use
>>>>>> isProcessingBacklog.
>>>>>> 
>>>>>> I am wondering, do you think it will be useful to discuss
>>> face-to-face
>>>>> via
>>>>>> video conference call? It is not just between you and me. We can
>>> invite
>>>>> the
>>>>>> developers who are interested to join and help with the discussion.
>>>> That
>>>>>> might improve communication efficiency and help us understand each
>>>> other
>>>>>> better :)
>>>>>> 
>>>>>> I am writing this long email to hopefully get your understanding. I
>>>> care
>>>>>> much more about the quality of the eventual solution rather than
>> who
>>>>>> proposed the solution. Please bear with me and see my comments
>>> inline,
>>>>> with
>>>>>> an explanation of the pros/cons of these proposals.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jul 5, 2023 at 11:06 PM Piotr Nowojski <
>>>> piotr.nowoj...@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Guys,
>>>>>>> 
>>>>>>> I would like to ask you again, to spend a bit more effort on
>> trying
>>>> to
>>>>>> find
>>>>>>> solutions, not just pointing out problems. For 1.5 months,
>>>>>>> the discussion doesn't go in circle, but I'm suggesting a
>> solution,
>>>> you
>>>>>> are
>>>>>>> trying to undermine it with some arguments, I'm coming
>>>>>>> back with a fix, often an extremely easy one, only for you to try
>>> to
>>>>> find
>>>>>>> yet another "issue". It doesn't bode well, if you are finding
>>>>>>> a "problem" that can be solved with a minute or so of thinking or
>>>> even
>>>>>> has
>>>>>>> already been solved.
>>>>>>> 
>>>>>>> I have provided you so far with at least three distinct solutions
>>>> that
>>>>>>> could address your exact target use-case. Two [1][2] generic
>>>>>>> enough to be probably good enough for the foreseeable future, one
>>>>>>> intermediate and not generic [3] but which wouldn't
>>>>>>> require @Public API changes or some custom hidden interfaces.
>>>>>> 
>>>>>> 
>>>>>>> All in all:
>>>>>>> - [1] with added metric hints like "isProcessingBacklog" solves
>>> your
>>>>>> target
>>>>>>> use case pretty well. Downside is having to improve
>>>>>>>  how JM is collecting/aggregating metrics
>>>>>>> 
>>>>>> 
>>>>>> Here is my analysis of this proposal compared to the current
>> approach
>>>> in
>>>>>> the FLIP-309.
>>>>>> 
>>>>>> pros:
>>>>>> - No need to add the public API
>>>>>> SplitEnumeratorContext#setIsProcessingBacklog.
>>>>>> cons:
>>>>>> - Need to add a public API that subclasses of SourceReader can use
>> to
>>>>>> specify its IsProcessingBacklog metric value.
>>>>>> - Source Coordinator needs to periodically pull the
>>> isProcessingBacklog
>>>>>> metrics from all TMs throughout the job execution.
>>>>>> 
>>>>>> Here is why I think the cons outweigh the pros:
>>>>>> 1) JM needs to collect/aggregate metrics with extra runtime
>> overhead,
>>>>> which
>>>>>> is not necessary for the target use-case with the push-based
>> approach
>>>> in
>>>>>> FLIP-309.
>>>>>> 2) For the target use-case, it is simpler and more intuitive for
>>> source
>>>>>> operators (e.g. HybridSource, MySQL CDC source) to be able to set
>> its
>>>>>> isProcessingBacklog status in the SplitEnumerator. This is because
>>> the
>>>>>> switch between bounded/unbounded stages happens in their
>>>> SplitEnumerator.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> - [2] is basically an equivalent of [1], replacing metrics with
>>>> events.
>>>>>> It
>>>>>>> also is a superset of your proposal
>>>>>>> 
>>>>>> 
>>>>>> Previously, I thought you meant to add a generic logic in
>>>>> SourceReaderBase
>>>>>> to read existing metrics (e.g. backpressure) and emit the
>>>>>> IsProcessingBacklogEvent to SourceCoordinator. I am sorry if I have
>>>>>> misunderstood your suggetions.
>>>>>> 
>>>>>> After double-checking your previous suggestion, I am wondering if
>> you
>>>> are
>>>>>> OK with the following approach:
>>>>>> 
>>>>>> - Add a job-level config
>>>> execution.checkpointing.interval-during-backlog
>>>>>> - Add an API SourceReaderContext#setProcessingBacklog(boolean
>>>>>> isProcessingBacklog).
>>>>>> - When this API is invoked, it internally sends an
>>>>>> internal SourceReaderBacklogEvent to SourceCoordinator.
>>>>>> - SourceCoordinator should keep track of the latest
>>> isProcessingBacklog
>>>>>> status from all its subtasks. And for now, we will hardcode the
>> logic
>>>>> such
>>>>>> that if any source reader says it is under backlog, then
>>>>>> execution.checkpointing.interval-during-backlog is used.
>>>>>> 
>>>>>> This approach looks good to me as it can achieve the same
>> performance
>>>>> with
>>>>>> the same number of public APIs for the target use-case. And I
>> suppose
>>>> in
>>>>>> the future we might be able to re-use this API for source reader to
>>> set
>>>>> its
>>>>>> backlog status based on its backpressure metrics, which could be an
>>>> extra
>>>>>> advantage over the current approach.
>>>>>> 
>>>>>> Do you think we can agree to adopt the approach described above?
>>>>>> 
>>>>>> 
>>>>>> - [3] yes, it's hacky, but it's a solution that could be thrown
>> away
>>>> once
>>>>>>> we implement [1] or [2] . The only real theoretical
>>>>>>>  downside is that it cannot control the long checkpoint exactly
>>>> (short
>>>>>>> checkpoint interval has to be a divisor of the long checkpoint
>>>>>>>  interval, but I simply can not imagine a practical use where
>> that
>>>>> would
>>>>>>> be a blocker for a user. Please..., someone wanting to set
>>>>>>>  short checkpoint interval to 3min and long to 7 minutes, and
>> that
>>>>>> someone
>>>>>>> can not accept the long interval to be 9 minutes?
>>>>>>>  And that's even ignoring the fact that if someone has an issue
>>> with
>>>>>> the 3
>>>>>>> minutes checkpoint interval, I can hardly think that merely
>>>>>>>  doubling the interval to 7 minutes would significantly solve
>> any
>>>>>> problem
>>>>>>> for that user.
>>>>>>> 
>>>>>> 
>>>>>> Yes, this is a fabricated example that shows
>>>>>> execution.checkpointing.interval-during-backlog might not be
>>> accurately
>>>>>> enforced with this option. I think you are probably right that it
>>> might
>>>>> not
>>>>>> matter that much. I just think we should try our best to make Flink
>>>>> public
>>>>>> API's semantics (including configuration) clear, simple, and
>>>> enforceable.
>>>>>> If we can make the user-facing configuration enforceable at the
>> cost
>>> of
>>>>> an
>>>>>> extra developer facing API (i.e. setProcessingBacklog(...)), I
>> would
>>>>> prefer
>>>>>> to do this.
>>>>>> 
>>>>>> It seems that we both agree that option [2] is better than [3]. I
>>> will
>>>>> skip
>>>>>> the further comments for this option and we can probably focus on
>>>>>> option [2] :)
>>>>>> 
>>>>>> 
>>>>>>> Dong a long time ago you wrote:
>>>>>>>> Sure. Then let's decide the final solution first.
>>>>>>> 
>>>>>>> Have you thought about that? Maybe I'm wrong but I don't remember
>>> you
>>>>>>> describing in any of your proposals how they could be
>>>>>>> extended in the future, to cover more generic cases. Regardless
>> if
>>>> you
>>>>>>> either don't believe in the generic solution or struggle to
>>>>>>> 
>>>>>> 
>>>>>> Yes, I have thought about the plan to extend the current FLIP to
>>>> support
>>>>>> metrics (e.g. backpressure) based solution you described earlier.
>>>>> Actually,
>>>>>> I mentioned multiple times in the earlier email that your
>> suggestion
>>> of
>>>>>> using metrics is valuable and I will do this in a follow-up FLIP.
>>>>>> 
>>>>>> Here are my comments from the previous email:
>>>>>> - See "I will add follow-up FLIPs to make use of the event-time
>>> metrics
>>>>> and
>>>>>> backpressure metrics" from Jul 3, 2023, 6:39 PM
>>>>>> - See "I agree it is valuable" from Jul 1, 2023, 11:00 PM
>>>>>> - See "we will create a followup FLIP (probably in FLIP-328)" from
>>> Jun
>>>>> 29,
>>>>>> 2023, 11:01 AM
>>>>>> 
>>>>>> Frankly speaking, I think the idea around using the backpressure
>>>> metrics
>>>>>> still needs a bit more thinking before we can propose a FLIP. But I
>>> am
>>>>>> pretty sure we can make use of the watermark/event-time to
>> determine
>>>> the
>>>>>> backlog status.
>>>>>> 
>>>>>> grasp it, if you can come back with something that can be easily
>>>> extended
>>>>>>> in the future, up to a point where one could implement
>>>>>>> something similar to this backpressure detecting algorithm that I
>>>>>> mentioned
>>>>>>> many times before, I would be happy to discuss and
>>>>>>> support it.
>>>>>>> 
>>>>>> 
>>>>>> Here is my idea of extending the source reader to support
>>>>> event-time-based
>>>>>> backlog detecting algorithms:
>>>>>> 
>>>>>> - Add a job-level config such as
>> watermark-lag-threshold-for-backlog.
>>>> If
>>>>>> any source reader determines that the event-timestamp is available
>>> and
>>>>> the
>>>>>> system-time - watermark exceeds this threshold, then the source
>>> reader
>>>>>> considers its isProcessingBacklog=true.
>>>>>> - The source reader can send an event to the source coordinator.
>> Note
>>>>> that
>>>>>> this might be doable in the SourceReaderBase without adding any
>>> public
>>>>> API
>>>>>> which the concrete SourceReader subclass needs to explicitly
>> invoke.
>>>>>> - And in the future if FLIP-325 is accepted, insteading of sending
>>> the
>>>>>> event to SourceCoordinator and let SourceCoordinator inform the
>>>>> checkpoint
>>>>>> coordinator, the source reader might just emit the information as
>>> part
>>>> of
>>>>>> the RecordAttributes and let the two-phase commit sink inform the
>>>>>> checkpoint coordinator.
>>>>>> 
>>>>>> Note that this is a sketch of the idea and it might need further
>>>>>> improvement. I just hope you understand that we have thought about
>>> this
>>>>>> idea and did quite a lot of thinking for these design options. If
>> it
>>> is
>>>>> OK
>>>>>> with you, I hope we can make incremental progress and discuss the
>>>>>> metrics-based solution separately in a follow-up FLIP.
>>>>>> 
>>>>>> Last but not least, thanks for taking so much time to leave
>> comments
>>>> and
>>>>>> help us improve the FLIP. Please kindly bear with us in this
>>>> discussion.
>>>>> I
>>>>>> am looking forward to collaborating with you to find the best
>> design
>>>> for
>>>>>> the target use-cases.
>>>>>> 
>>>>>> Best,
>>>>>> Dong
>>>>>> 
>>>>>> 
>>>>>>> Hang, about your points 1. and 2., do you think those problems
>> are
>>>>>>> insurmountable and blockers for that counter proposal?
>>>>>>> 
>>>>>>>> 1. It is hard to find the error checkpoint.
>>>>>>> 
>>>>>>> No it's not, please take a look at what I exactly proposed and
>>> maybe
>>>> at
>>>>>> the
>>>>>>> code.
>>>>>>> 
>>>>>>>> 2. (...) The failed checkpoint may make them think the job is
>>>>>> unhealthy.
>>>>>>> 
>>>>>>> Please read again what I wrote in [3]. I'm mentioning there a
>>>> solution
>>>>>> for
>>>>>>> this exact "problem".
>>>>>>> 
>>>>>>> About the necessity of the config value, I'm still not convinced
>>>> that's
>>>>>>> needed from the start, but yes we can add some config option
>>>>>>> if you think otherwise. This option, if named properly, could be
>>>>> re-used
>>>>>> in
>>>>>>> the future for different solutions, so that's fine by me.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Piotrek
>>>>>>> 
>>>>>>> [1] Introduced in my very first e-mail from 23 maj 2023, 16:26,
>> and
>>>>>> refined
>>>>>>> later with point "2." in my e-mail from 16 June 2023, 17:58
>>>>>>> [2] Section "2. ===============" in my e-mail from 30 June 2023,
>>>> 16:34
>>>>>>> [3] Section "3. ===============" in my e-mail from 30 June 2023,
>>>> 16:34
>>>>>>> 
>>>>>>> All times in CEST.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Reply via email to