Thanks Jack, Jingsong, and Zhu for the review! Thanks Zhu for the suggestion. I have updated the configuration name as suggested.
On Tue, Jun 27, 2023 at 4:45 PM Zhu Zhu <reed...@gmail.com> wrote: > Thanks Dong and Yunfeng for creating this FLIP and driving this discussion. > > The new design looks generally good to me. Increasing the checkpoint > interval when the job is processing backlogs is easier for users to > understand and can help in more scenarios. > > I have one comment about the new configuration. > Naming the new configuration > "execution.checkpointing.interval-during-backlog" would be better > according to Flink config naming convention. > It is also because that nested config keys should be avoided. See > FLINK-29372 for more details. > > Thanks, > Zhu > > Jingsong Li <jingsongl...@gmail.com> 于2023年6月27日周二 15:45写道: > > > > Looks good to me! > > > > Thanks Dong, Yunfeng and all for your discussion and design. > > > > Best, > > Jingsong > > > > On Tue, Jun 27, 2023 at 3:35 PM Jark Wu <imj...@gmail.com> wrote: > > > > > > Thank you Dong for driving this FLIP. > > > > > > The new design looks good to me! > > > > > > Best, > > > Jark > > > > > > > 2023年6月27日 14:38,Dong Lin <lindon...@gmail.com> 写道: > > > > > > > > Thank you Leonard for the review! > > > > > > > > Hi Piotr, do you have any comments on the latest proposal? > > > > > > > > I am wondering if it is OK to start the voting thread this week. > > > > > > > > On Mon, Jun 26, 2023 at 4:10 PM Leonard Xu <xbjt...@gmail.com> > wrote: > > > > > > > >> Thanks Dong for driving this FLIP forward! > > > >> > > > >> Introducing `backlog status` concept for flink job makes sense to > me as > > > >> following reasons: > > > >> > > > >> From concept/API design perspective, it’s more general and natural > than > > > >> above proposals as it can be used in HybridSource for bounded > records, CDC > > > >> Source for history snapshot and general sources like KafkaSource for > > > >> historical messages. > > > >> > > > >> From user cases/requirements, I’ve seen many users manually to set > larger > > > >> checkpoint interval during backfilling and then set a shorter > checkpoint > > > >> interval for real-time processing in their production environments > as a > > > >> flink application optimization. Now, the flink framework can make > this > > > >> optimization no longer require the user to set the checkpoint > interval and > > > >> restart the job multiple times. > > > >> > > > >> Following supporting using larger checkpoint for job under backlog > status > > > >> in current FLIP, we can explore supporting larger > parallelism/memory/cpu > > > >> for job under backlog status in the future. > > > >> > > > >> In short, the updated FLIP looks good to me. > > > >> > > > >> > > > >> Best, > > > >> Leonard > > > >> > > > >> > > > >>> On Jun 22, 2023, at 12:07 PM, Dong Lin <lindon...@gmail.com> > wrote: > > > >>> > > > >>> Hi Piotr, > > > >>> > > > >>> Thanks again for proposing the isProcessingBacklog concept. > > > >>> > > > >>> After discussing with Becket Qin and thinking about this more, I > agree it > > > >>> is a better idea to add a top-level concept to all source > operators to > > > >>> address the target use-case. > > > >>> > > > >>> The main reason that changed my mind is that isProcessingBacklog > can be > > > >>> described as an inherent/nature attribute of every source instance > and > > > >> its > > > >>> semantics does not need to depend on any specific checkpointing > policy. > > > >>> Also, we can hardcode the isProcessingBacklog behavior for the > sources we > > > >>> have considered so far (e.g. HybridSource and MySQL CDC source) > without > > > >>> asking users to explicitly configure the per-source behavior, which > > > >> indeed > > > >>> provides better user experience. > > > >>> > > > >>> I have updated the FLIP based on the latest suggestions. The > latest FLIP > > > >> no > > > >>> longer introduces per-source config that can be used by end-users. > While > > > >> I > > > >>> agree with you that CheckpointTrigger can be a useful feature to > address > > > >>> additional use-cases, I am not sure it is necessary for the > use-case > > > >>> targeted by FLIP-309. Maybe we can introduce CheckpointTrigger > separately > > > >>> in another FLIP? > > > >>> > > > >>> Can you help take another look at the updated FLIP? > > > >>> > > > >>> Best, > > > >>> Dong > > > >>> > > > >>> > > > >>> > > > >>> On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski < > pnowoj...@apache.org> > > > >>> wrote: > > > >>> > > > >>>> Hi Dong, > > > >>>> > > > >>>>> Suppose there are 1000 subtask and each subtask has 1% chance of > being > > > >>>>> "backpressured" at a given time (due to random traffic spikes). > Then at > > > >>>> any > > > >>>>> given time, the chance of the job > > > >>>>> being considered not-backpressured = (1-0.01)^1000. Since we > evaluate > > > >> the > > > >>>>> backpressure metric once a second, the estimated time for the job > > > >>>>> to be considered not-backpressured is roughly 1 / > ((1-0.01)^1000) = > > > >> 23163 > > > >>>>> sec = 6.4 hours. > > > >>>>> > > > >>>>> This means that the job will effectively always use the longer > > > >>>>> checkpointing interval. It looks like a real concern, right? > > > >>>> > > > >>>> Sorry I don't understand where you are getting those numbers from. > > > >>>> Instead of trying to find loophole after loophole, could you try > to > > > >> think > > > >>>> how a given loophole could be improved/solved? > > > >>>> > > > >>>>> Hmm... I honestly think it will be useful to know the APIs due > to the > > > >>>>> following reasons. > > > >>>> > > > >>>> Please propose something. I don't think it's needed. > > > >>>> > > > >>>>> - For the use-case mentioned in FLIP-309 motivation section, > would the > > > >>>> APIs > > > >>>>> of this alternative approach be more or less usable? > > > >>>> > > > >>>> Everything that you originally wanted to achieve in FLIP-309, you > could > > > >> do > > > >>>> as well in my proposal. > > > >>>> Vide my many mentions of the "hacky solution". > > > >>>> > > > >>>>> - Can these APIs reliably address the extra use-case (e.g. allow > > > >>>>> checkpointing interval to change dynamically even during the > unbounded > > > >>>>> phase) as it claims? > > > >>>> > > > >>>> I don't see why not. > > > >>>> > > > >>>>> - Can these APIs be decoupled from the APIs currently proposed in > > > >>>> FLIP-309? > > > >>>> > > > >>>> Yes > > > >>>> > > > >>>>> For example, if the APIs of this alternative approach can be > decoupled > > > >>>> from > > > >>>>> the APIs currently proposed in FLIP-309, then it might be > reasonable to > > > >>>>> work on this extra use-case with a more advanced/complicated > design > > > >>>>> separately in a followup work. > > > >>>> > > > >>>> As I voiced my concerns previously, the current design of > FLIP-309 would > > > >>>> clog the public API and in the long run confuse the users. IMO > It's > > > >>>> addressing the > > > >>>> problem in the wrong place. > > > >>>> > > > >>>>> Hmm.. do you mean we can do the following: > > > >>>>> - Have all source operators emit a metric named > "processingBacklog". > > > >>>>> - Add a job-level config that specifies "the checkpointing > interval to > > > >> be > > > >>>>> used when any source is processing backlog". > > > >>>>> - The JM collects the "processingBacklog" periodically from all > source > > > >>>>> operators and uses the newly added config value as appropriate. > > > >>>> > > > >>>> Yes. > > > >>>> > > > >>>>> The challenge with this approach is that we need to define the > > > >> semantics > > > >>>> of > > > >>>>> this "processingBacklog" metric and have all source operators > > > >>>>> implement this metric. I am not sure we are able to do this yet > without > > > >>>>> having users explicitly provide this information on a per-source > basis. > > > >>>>> > > > >>>>> Suppose the job read from a bounded Kafka source, should it emit > > > >>>>> "processingBacklog=true"? If yes, then the job might use long > > > >>>> checkpointing > > > >>>>> interval even > > > >>>>> if the job is asked to process data starting from now to the > next 1 > > > >> hour. > > > >>>>> If no, then the job might use the short checkpointing interval > > > >>>>> even if the job is asked to re-process data starting from 7 days > ago. > > > >>>> > > > >>>> Yes. The same can be said of your proposal. Your proposal has the > very > > > >> same > > > >>>> issues > > > >>>> that every source would have to implement it differently, most > sources > > > >>>> would > > > >>>> have no idea how to properly calculate the new requested > checkpoint > > > >>>> interval, > > > >>>> for those that do know how to do that, user would have to > configure > > > >> every > > > >>>> source > > > >>>> individually and yet again we would end up with a system, that > works > > > >> only > > > >>>> partially in > > > >>>> some special use cases (HybridSource), that's confusing the users > even > > > >>>> more. > > > >>>> > > > >>>> That's why I think the more generic solution, working primarily > on the > > > >> same > > > >>>> metrics that are used by various auto scaling solutions (like > Flink K8s > > > >>>> operator's > > > >>>> autosaler) would be better. The hacky solution I proposed to: > > > >>>> 1. show you that the generic solution is simply a superset of your > > > >> proposal > > > >>>> 2. if you are adamant that busyness/backpressured/records > processing > > > >>>> rate/pending records > > > >>>> metrics wouldn't cover your use case sufficiently (imo they > can), > > > >> then > > > >>>> you can very easily > > > >>>> enhance this algorithm with using some hints from the sources. > Like > > > >>>> "processingBacklog==true" > > > >>>> to short circuit the main algorithm, if `processingBacklog` is > > > >>>> available. > > > >>>> > > > >>>> Best, > > > >>>> Piotrek > > > >>>> > > > >>>> > > > >>>> pt., 16 cze 2023 o 04:45 Dong Lin <lindon...@gmail.com> > napisał(a): > > > >>>> > > > >>>>> Hi again Piotr, > > > >>>>> > > > >>>>> Thank you for the reply. Please see my reply inline. > > > >>>>> > > > >>>>> On Fri, Jun 16, 2023 at 12:11 AM Piotr Nowojski < > > > >>>> piotr.nowoj...@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi again Dong, > > > >>>>>> > > > >>>>>>> I understand that JM will get the backpressure-related metrics > every > > > >>>>> time > > > >>>>>>> the RestServerEndpoint receives the REST request to get these > > > >>>> metrics. > > > >>>>>> But > > > >>>>>>> I am not sure if RestServerEndpoint is already always > receiving the > > > >>>>> REST > > > >>>>>>> metrics at regular interval (suppose there is no human manually > > > >>>>>>> opening/clicking the Flink Web UI). And if it does, what is the > > > >>>>> interval? > > > >>>>>> > > > >>>>>> Good catch, I've thought that metrics are pre-emptively sent to > JM > > > >>>> every > > > >>>>> 10 > > > >>>>>> seconds. > > > >>>>>> Indeed that's not the case at the moment, and that would have > to be > > > >>>>>> improved. > > > >>>>>> > > > >>>>>>> I would be surprised if Flink is already paying this much > overhead > > > >>>> just > > > >>>>>> for > > > >>>>>>> metrics monitoring. That is the main reason I still doubt it > is true. > > > >>>>> Can > > > >>>>>>> you show where this 100 ms is currently configured? > > > >>>>>>> > > > >>>>>>> Alternatively, maybe you mean that we should add extra code to > invoke > > > >>>>> the > > > >>>>>>> REST API at 100 ms interval. Then that means we need to > considerably > > > >>>>>>> increase the network/cpu overhead at JM, where the overhead > will > > > >>>>> increase > > > >>>>>>> as the number of TM/slots increase, which may pose risk to the > > > >>>>>> scalability > > > >>>>>>> of the proposed design. I am not sure we should do this. What > do you > > > >>>>>> think? > > > >>>>>> > > > >>>>>> Sorry. I didn't mean metric should be reported every 100ms. I > meant > > > >>>> that > > > >>>>>> "backPressuredTimeMsPerSecond (metric) would report (a value of) > > > >>>>> 100ms/s." > > > >>>>>> once per metric interval (10s?). > > > >>>>>> > > > >>>>> > > > >>>>> Suppose there are 1000 subtask and each subtask has 1% chance of > being > > > >>>>> "backpressured" at a given time (due to random traffic spikes). > Then at > > > >>>> any > > > >>>>> given time, the chance of the job > > > >>>>> being considered not-backpressured = (1-0.01)^1000. Since we > evaluate > > > >> the > > > >>>>> backpressure metric once a second, the estimated time for the job > > > >>>>> to be considered not-backpressured is roughly 1 / > ((1-0.01)^1000) = > > > >> 23163 > > > >>>>> sec = 6.4 hours. > > > >>>>> > > > >>>>> This means that the job will effectively always use the longer > > > >>>>> checkpointing interval. It looks like a real concern, right? > > > >>>>> > > > >>>>> > > > >>>>>>> - What is the interface of this CheckpointTrigger? For > example, are > > > >>>> we > > > >>>>>>> going to give CheckpointTrigger a context that it can use to > fetch > > > >>>>>>> arbitrary metric values? This can help us understand what > information > > > >>>>>> this > > > >>>>>>> user-defined CheckpointTrigger can use to make the checkpoint > > > >>>> decision. > > > >>>>>> > > > >>>>>> I honestly don't think this is important at this stage of the > > > >>>> discussion. > > > >>>>>> It could have > > > >>>>>> whatever interface we would deem to be best. Required things: > > > >>>>>> > > > >>>>>> - access to at least a subset of metrics that the given > > > >>>>> `CheckpointTrigger` > > > >>>>>> requests, > > > >>>>>> for example via some registration mechanism, so we don't have to > > > >>>> fetch > > > >>>>>> all of the > > > >>>>>> metrics all the time from TMs. > > > >>>>>> - some way to influence `CheckpointCoordinator`. Either via > manually > > > >>>>>> triggering > > > >>>>>> checkpoints, and/or ability to change the checkpointing > interval. > > > >>>>>> > > > >>>>> > > > >>>>> Hmm... I honestly think it will be useful to know the APIs due > to the > > > >>>>> following reasons. > > > >>>>> > > > >>>>> We would need to know the concrete APIs to gauge the following: > > > >>>>> - For the use-case mentioned in FLIP-309 motivation section, > would the > > > >>>> APIs > > > >>>>> of this alternative approach be more or less usable? > > > >>>>> - Can these APIs reliably address the extra use-case (e.g. allow > > > >>>>> checkpointing interval to change dynamically even during the > unbounded > > > >>>>> phase) as it claims? > > > >>>>> - Can these APIs be decoupled from the APIs currently proposed in > > > >>>> FLIP-309? > > > >>>>> > > > >>>>> For example, if the APIs of this alternative approach can be > decoupled > > > >>>> from > > > >>>>> the APIs currently proposed in FLIP-309, then it might be > reasonable to > > > >>>>> work on this extra use-case with a more advanced/complicated > design > > > >>>>> separately in a followup work. > > > >>>>> > > > >>>>> > > > >>>>>>> - Where is this CheckpointTrigger running? For example, is it > going > > > >>>> to > > > >>>>>> run > > > >>>>>>> on the subtask of every source operator? Or is it going to run > on the > > > >>>>> JM? > > > >>>>>> > > > >>>>>> IMO on the JM. > > > >>>>>> > > > >>>>>>> - Are we going to provide a default implementation of this > > > >>>>>>> CheckpointTrigger in Flink that implements the algorithm > described > > > >>>>> below, > > > >>>>>>> or do we expect each source operator developer to implement > their own > > > >>>>>>> CheckpointTrigger? > > > >>>>>> > > > >>>>>> As I mentioned before, I think we should provide at the very > least the > > > >>>>>> implementation > > > >>>>>> that replaces the current triggering mechanism (statically > configured > > > >>>>>> checkpointing interval) > > > >>>>>> and it would be great to provide the backpressure monitoring > trigger > > > >> as > > > >>>>>> well. > > > >>>>>> > > > >>>>> > > > >>>>> I agree that if there is a good use-case that can be addressed > by the > > > >>>>> proposed CheckpointTrigger, then it is reasonable > > > >>>>> to add CheckpointTrigger and replace the current triggering > mechanism > > > >>>> with > > > >>>>> it. > > > >>>>> > > > >>>>> I also agree that we will likely find such a use-case. For > example, > > > >>>> suppose > > > >>>>> the source records have event timestamps, then it is likely > > > >>>>> that we can use the trigger to dynamically control the > checkpointing > > > >>>>> interval based on the difference between the watermark and > current > > > >> system > > > >>>>> time. > > > >>>>> > > > >>>>> But I am not sure the addition of this CheckpointTrigger should > be > > > >>>> coupled > > > >>>>> with FLIP-309. Whether or not it is coupled probably depends on > the > > > >>>>> concrete API design around CheckpointTrigger. > > > >>>>> > > > >>>>> If you would be adamant that the backpressure monitoring doesn't > cover > > > >>>> well > > > >>>>>> enough your use case, I would be ok to provide the hacky > version that > > > >> I > > > >>>>>> also mentioned > > > >>>>>> before: > > > >>>>> > > > >>>>> > > > >>>>>> """ > > > >>>>>> Especially that if my proposed algorithm wouldn't work good > enough, > > > >>>> there > > > >>>>>> is > > > >>>>>> an obvious solution, that any source could add a metric, like > let say > > > >>>>>> "processingBacklog: true/false", and the `CheckpointTrigger` > > > >>>>>> could use this as an override to always switch to the > > > >>>>>> "slowCheckpointInterval". I don't think we need it, but that's > always > > > >>>> an > > > >>>>>> option > > > >>>>>> that would be basically equivalent to your original proposal. > > > >>>>>> """ > > > >>>>>> > > > >>>>> > > > >>>>> Hmm.. do you mean we can do the following: > > > >>>>> - Have all source operators emit a metric named > "processingBacklog". > > > >>>>> - Add a job-level config that specifies "the checkpointing > interval to > > > >> be > > > >>>>> used when any source is processing backlog". > > > >>>>> - The JM collects the "processingBacklog" periodically from all > source > > > >>>>> operators and uses the newly added config value as appropriate. > > > >>>>> > > > >>>>> The challenge with this approach is that we need to define the > > > >> semantics > > > >>>> of > > > >>>>> this "processingBacklog" metric and have all source operators > > > >>>>> implement this metric. I am not sure we are able to do this yet > without > > > >>>>> having users explicitly provide this information on a per-source > basis. > > > >>>>> > > > >>>>> Suppose the job read from a bounded Kafka source, should it emit > > > >>>>> "processingBacklog=true"? If yes, then the job might use long > > > >>>> checkpointing > > > >>>>> interval even > > > >>>>> if the job is asked to process data starting from now to the > next 1 > > > >> hour. > > > >>>>> If no, then the job might use the short checkpointing interval > > > >>>>> even if the job is asked to re-process data starting from 7 days > ago. > > > >>>>> > > > >>>>> > > > >>>>>> > > > >>>>>>> - How can users specify the > > > >>>>>> fastCheckpointInterval/slowCheckpointInterval? > > > >>>>>>> For example, will we provide APIs on the CheckpointTrigger that > > > >>>>> end-users > > > >>>>>>> can use to specify the checkpointing interval? What would that > look > > > >>>>> like? > > > >>>>>> > > > >>>>>> Also as I mentioned before, just like metric reporters are > configured: > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/ > > > >>>>>> Every CheckpointTrigger could have its own custom configuration. > > > >>>>>> > > > >>>>>>> Overall, my gut feel is that the alternative approach based on > > > >>>>>>> CheckpointTrigger is more complicated > > > >>>>>> > > > >>>>>> Yes, as usual, more generic things are more complicated, but > often > > > >> more > > > >>>>>> useful in the long run. > > > >>>>>> > > > >>>>>>> and harder to use. > > > >>>>>> > > > >>>>>> I don't agree. Why setting in config > > > >>>>>> > > > >>>>>> execution.checkpointing.trigger: > > > >>>> BackPressureMonitoringCheckpointTrigger > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > execution.checkpointing.BackPressureMonitoringCheckpointTrigger.fast-interval: > > > >>>>>> 1s > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > execution.checkpointing.BackPressureMonitoringCheckpointTrigger.slow-interval: > > > >>>>>> 30s > > > >>>>>> > > > >>>>>> that we could even provide a shortcut to the above construct > via: > > > >>>>>> > > > >>>>>> execution.checkpointing.fast-interval: 1s > > > >>>>>> execution.checkpointing.slow-interval: 30s > > > >>>>>> > > > >>>>>> is harder compared to setting two/three checkpoint intervals, > one in > > > >>>> the > > > >>>>>> config/or via `env.enableCheckpointing(x)`, > > > >>>>>> secondly passing one/two (fast/slow) values on the source > itself? > > > >>>>>> > > > >>>>> > > > >>>>> If we can address the use-case by providing just the two > job-level > > > >> config > > > >>>>> as described above, I agree it will indeed be simpler. > > > >>>>> > > > >>>>> I have tried to achieve this goal. But the caveat is that it > requires > > > >>>> much > > > >>>>> more work than described above in order to give the configs > > > >> well-defined > > > >>>>> semantics. So I find it simpler to just use the approach in > FLIP-309. > > > >>>>> > > > >>>>> Let me explain my concern below. It will be great if you or > someone > > > >> else > > > >>>>> can help provide a solution. > > > >>>>> > > > >>>>> 1) We need to clearly document when the fast-interval and > slow-interval > > > >>>>> will be used so that users can derive the expected behavior of > the job > > > >>>> and > > > >>>>> be able to config these values. > > > >>>>> > > > >>>>> 2) The trigger of fast/slow interval depends on the behavior of > the > > > >>>> source > > > >>>>> (e.g. MySQL CDC, HybridSource). However, no existing concepts of > source > > > >>>>> operator (e.g. boundedness) can describe the target behavior. For > > > >>>> example, > > > >>>>> MySQL CDC internally has two phases, namely snapshot phase and > binlog > > > >>>>> phase, which are not explicitly exposed to its users via source > > > >> operator > > > >>>>> API. And we probably should not enumerate all internal phases of > all > > > >>>> source > > > >>>>> operators that are affected by fast/slow interval. > > > >>>>> > > > >>>>> 3) An alternative approach might be to define a new concept (e.g. > > > >>>>> processingBacklog) that is applied to all source operators. Then > the > > > >>>>> fast/slow interval's documentation can depend on this concept. > That > > > >> means > > > >>>>> we have to add a top-level concept (similar to source > boundedness) and > > > >>>>> require all source operators to specify how they enforce this > concept > > > >>>> (e.g. > > > >>>>> FileSystemSource always emits processingBacklog=true). And there > might > > > >> be > > > >>>>> cases where the source itself (e.g. a bounded Kafka Source) can > not > > > >>>>> automatically derive the value of this concept, in which case we > need > > > >> to > > > >>>>> provide option for users to explicitly specify the value for this > > > >> concept > > > >>>>> on a per-source basis. > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>>>> And it probably also has the issues of "having two places to > > > >>>> configure > > > >>>>>> checkpointing > > > >>>>>>> interval" and "giving flexibility for every source to > implement a > > > >>>>>> different > > > >>>>>>> API" (as mentioned below). > > > >>>>>> > > > >>>>>> No, it doesn't. > > > >>>>>> > > > >>>>>>> IMO, it is a hard-requirement for the user-facing API to be > > > >>>>>>> clearly defined and users should be able to use the API without > > > >>>> concern > > > >>>>>> of > > > >>>>>>> regression. And this requirement is more important than the > other > > > >>>> goals > > > >>>>>>> discussed above because it is related to the > stability/performance of > > > >>>>> the > > > >>>>>>> production job. What do you think? > > > >>>>>> > > > >>>>>> I don't agree with this. There are many things that work > something in > > > >>>>>> between perfectly and well enough > > > >>>>>> in some fraction of use cases (maybe in 99%, maybe 95% or maybe > 60%), > > > >>>>> while > > > >>>>>> still being very useful. > > > >>>>>> Good examples are: selection of state backend, unaligned > checkpoints, > > > >>>>>> buffer debloating but frankly if I go > > > >>>>>> through list of currently available config options, something > like > > > >> half > > > >>>>> of > > > >>>>>> them can cause regressions. Heck, > > > >>>>>> even Flink itself doesn't work perfectly in 100% of the use > cases, due > > > >>>>> to a > > > >>>>>> variety of design choices. Of > > > >>>>>> course, the more use cases are fine with said feature, the > better, but > > > >>>> we > > > >>>>>> shouldn't fixate to perfectly cover > > > >>>>>> 100% of the cases, as that's impossible. > > > >>>>>> > > > >>>>>> In this particular case, if back pressure monitoring trigger > can work > > > >>>>> well > > > >>>>>> enough in 95% of cases, I would > > > >>>>>> say that's already better than the originally proposed > alternative, > > > >>>> which > > > >>>>>> doesn't work at all if user has a large > > > >>>>>> backlog to reprocess from Kafka, including when using > HybridSource > > > >>>> AFTER > > > >>>>>> the switch to Kafka has > > > >>>>>> happened. For the remaining 5%, we should try to improve the > behaviour > > > >>>>> over > > > >>>>>> time, but ultimately, users can > > > >>>>>> decide to just run a fixed checkpoint interval (or at worst use > the > > > >>>> hacky > > > >>>>>> checkpoint trigger that I mentioned > > > >>>>>> before a couple of times). > > > >>>>>> > > > >>>>>> Also to be pedantic, if a user naively selects slow-interval in > your > > > >>>>>> proposal to 30 minutes, when that user's > > > >>>>>> job fails on average every 15-20minutes, his job can end up in > a state > > > >>>>> that > > > >>>>>> it can not make any progress, > > > >>>>>> this arguably is quite serious regression. > > > >>>>>> > > > >>>>> > > > >>>>> I probably should not say it is "hard requirement". After all > there are > > > >>>>> pros/cons. We will need to consider implementation complexity, > > > >> usability, > > > >>>>> extensibility etc. > > > >>>>> > > > >>>>> I just don't think we should take it for granted to introduce > > > >> regression > > > >>>>> for one use-case in order to support another use-case. If we can > not > > > >> find > > > >>>>> an algorithm/solution that addresses > > > >>>>> both use-case well, I hope we can be open to tackle them > separately so > > > >>>> that > > > >>>>> users can choose the option that best fits their needs. > > > >>>>> > > > >>>>> All things else being equal, I think it is preferred for > user-facing > > > >> API > > > >>>> to > > > >>>>> be clearly defined and let users should be able to use the API > without > > > >>>>> concern of regression. > > > >>>>> > > > >>>>> Maybe we can list pros/cons for the alternative approaches we > have been > > > >>>>> discussing and see choose the best approach. And maybe we will > end up > > > >>>>> finding that use-case > > > >>>>> which needs CheckpointTrigger can be tackled separately from the > > > >> use-case > > > >>>>> in FLIP-309. > > > >>>>> > > > >>>>> > > > >>>>>>> I am not sure if there is a typo. Because if > > > >>>>> backPressuredTimeMsPerSecond > > > >>>>>> = > > > >>>>>>> 0, then maxRecordsConsumedWithoutBackpressure = > > > >>>> numRecordsInPerSecond / > > > >>>>>>> 1000 * metricsUpdateInterval according to the above algorithm. > > > >>>>>>> > > > >>>>>>> Do you mean "maxRecordsConsumedWithoutBackpressure = > > > >>>>>> (numRecordsInPerSecond > > > >>>>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) * > > > >>>> metricsUpdateInterval"? > > > >>>>>> > > > >>>>>> It looks like there is indeed some mistake in my proposal > above. Yours > > > >>>>> look > > > >>>>>> more correct, it probably > > > >>>>>> still needs some safeguard/special handling if > > > >>>>>> `backPressuredTimeMsPerSecond > 950` > > > >>>>>> > > > >>>>>>> The only information it can access is the backlog. > > > >>>>>> > > > >>>>>> Again no. It can access whatever we want to provide to it. > > > >>>>>> > > > >>>>>> Regarding the rest of your concerns. It's a matter of tweaking > the > > > >>>>>> parameters and the algorithm itself, > > > >>>>>> and how much safety-net do we want to have. Ultimately, I'm > pretty > > > >> sure > > > >>>>>> that's a (for 95-99% of cases) > > > >>>>>> solvable problem. If not, there is always the hacky solution, > that > > > >>>> could > > > >>>>> be > > > >>>>>> even integrated into this above > > > >>>>>> mentioned algorithm as a short circuit to always reach > > > >> `slow-interval`. > > > >>>>>> > > > >>>>>> Apart of that, you picked 3 minutes as the checkpointing > interval in > > > >>>> your > > > >>>>>> counter example. In most cases > > > >>>>>> any interval above 1 minute would inflict pretty negligible > overheads, > > > >>>> so > > > >>>>>> all in all, I would doubt there is > > > >>>>>> a significant benefit (in most cases) of increasing 3 minute > > > >> checkpoint > > > >>>>>> interval to anything more, let alone > > > >>>>>> 30 minutes. > > > >>>>>> > > > >>>>> > > > >>>>> I am not sure we should design the algorithm with the assumption > that > > > >> the > > > >>>>> short checkpointing interval will always be higher than 1 minute > etc. > > > >>>>> > > > >>>>> I agree the proposed algorithm can solve most cases where the > resource > > > >> is > > > >>>>> sufficient and there is always no backlog in source subtasks. On > the > > > >>>> other > > > >>>>> hand, what makes SRE > > > >>>>> life hard is probably the remaining 1-5% cases where the traffic > is > > > >> spiky > > > >>>>> and the cluster is reaching its capacity limit. The ability to > predict > > > >>>> and > > > >>>>> control Flink job's behavior (including checkpointing interval) > can > > > >>>>> considerably reduce the burden of manging Flink jobs. > > > >>>>> > > > >>>>> Best, > > > >>>>> Dong > > > >>>>> > > > >>>>> > > > >>>>>> > > > >>>>>> Best, > > > >>>>>> Piotrek > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> sob., 3 cze 2023 o 05:44 Dong Lin <lindon...@gmail.com> > napisał(a): > > > >>>>>> > > > >>>>>>> Hi Piotr, > > > >>>>>>> > > > >>>>>>> Thanks for the explanations. I have some followup questions > below. > > > >>>>>>> > > > >>>>>>> On Fri, Jun 2, 2023 at 10:55 PM Piotr Nowojski < > pnowoj...@apache.org > > > >>>>> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Hi All, > > > >>>>>>>> > > > >>>>>>>> Thanks for chipping in the discussion Ahmed! > > > >>>>>>>> > > > >>>>>>>> Regarding using the REST API. Currently I'm leaning towards > > > >>>>>> implementing > > > >>>>>>>> this feature inside the Flink itself, via some pluggable > interface. > > > >>>>>>>> REST API solution would be tempting, but I guess not everyone > is > > > >>>>> using > > > >>>>>>>> Flink Kubernetes Operator. > > > >>>>>>>> > > > >>>>>>>> @Dong > > > >>>>>>>> > > > >>>>>>>>> I am not sure metrics such as isBackPressured are already > sent to > > > >>>>> JM. > > > >>>>>>>> > > > >>>>>>>> Fetching code path on the JM: > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl#queryTmMetricsFuture > > > >>>>>>>> > > > >>>> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore#add > > > >>>>>>>> > > > >>>>>>>> Example code path accessing Task level metrics via JM using > the > > > >>>>>>>> `MetricStore`: > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > org.apache.flink.runtime.rest.handler.job.metrics.AggregatingSubtasksMetricsHandler > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> Thanks for the code reference. I checked the code that invoked > these > > > >>>>> two > > > >>>>>>> classes and found the following information: > > > >>>>>>> > > > >>>>>>> - AggregatingSubtasksMetricsHandler#getStoresis currently > invoked > > > >>>> only > > > >>>>>>> when AggregatingJobsMetricsHandler is invoked. > > > >>>>>>> - AggregatingJobsMetricsHandler is only instantiated and > returned by > > > >>>>>>> WebMonitorEndpoint#initializeHandlers > > > >>>>>>> - WebMonitorEndpoint#initializeHandlers is only used by > > > >>>>>> RestServerEndpoint. > > > >>>>>>> And RestServerEndpoint invokes these handlers in response to > external > > > >>>>>> REST > > > >>>>>>> request. > > > >>>>>>> > > > >>>>>>> I understand that JM will get the backpressure-related metrics > every > > > >>>>> time > > > >>>>>>> the RestServerEndpoint receives the REST request to get these > > > >>>> metrics. > > > >>>>>> But > > > >>>>>>> I am not sure if RestServerEndpoint is already always > receiving the > > > >>>>> REST > > > >>>>>>> metrics at regular interval (suppose there is no human manually > > > >>>>>>> opening/clicking the Flink Web UI). And if it does, what is the > > > >>>>> interval? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>>> For example, let's say every source operator subtask reports > this > > > >>>>>>> metric > > > >>>>>>>> to > > > >>>>>>>>> JM once every 10 seconds. There are 100 source subtasks. And > each > > > >>>>>>> subtask > > > >>>>>>>>> is backpressured roughly 10% of the total time due to traffic > > > >>>>> spikes > > > >>>>>>> (and > > > >>>>>>>>> limited buffer). Then at any given time, there are 1 - > 0.9^100 = > > > >>>>>>> 99.997% > > > >>>>>>>>> chance that there is at least one subtask that is > backpressured. > > > >>>>> Then > > > >>>>>>> we > > > >>>>>>>>> have to wait for at least 10 seconds to check again. > > > >>>>>>>> > > > >>>>>>>> backPressuredTimeMsPerSecond and other related metrics (like > > > >>>>>>>> busyTimeMsPerSecond) are not subject to that problem. > > > >>>>>>>> They are recalculated once every metric fetching interval, > and they > > > >>>>>>> report > > > >>>>>>>> accurately on average the given subtask spent > > > >>>>>> busy/idling/backpressured. > > > >>>>>>>> In your example, backPressuredTimeMsPerSecond would report > 100ms/s. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Suppose every subtask is already reporting > > > >>>> backPressuredTimeMsPerSecond > > > >>>>>> to > > > >>>>>>> JM once every 100 ms. If a job has 10 operators (that are not > > > >>>> chained) > > > >>>>>> and > > > >>>>>>> each operator has 100 subtasks, then JM would need to handle > 10000 > > > >>>>>> requests > > > >>>>>>> per second to receive metrics from these 1000 subtasks. It > seems > > > >>>> like a > > > >>>>>>> non-trivial overhead for medium-to-large sized jobs and can > make JM > > > >>>> the > > > >>>>>>> performance bottleneck during job execution. > > > >>>>>>> > > > >>>>>>> I would be surprised if Flink is already paying this much > overhead > > > >>>> just > > > >>>>>> for > > > >>>>>>> metrics monitoring. That is the main reason I still doubt it > is true. > > > >>>>> Can > > > >>>>>>> you show where this 100 ms is currently configured? > > > >>>>>>> > > > >>>>>>> Alternatively, maybe you mean that we should add extra code to > invoke > > > >>>>> the > > > >>>>>>> REST API at 100 ms interval. Then that means we need to > considerably > > > >>>>>>> increase the network/cpu overhead at JM, where the overhead > will > > > >>>>> increase > > > >>>>>>> as the number of TM/slots increase, which may pose risk to the > > > >>>>>> scalability > > > >>>>>>> of the proposed design. I am not sure we should do this. What > do you > > > >>>>>> think? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>>> While it will be nice to support additional use-cases > > > >>>>>>>>> with one proposal, it is probably also reasonable to make > > > >>>>> incremental > > > >>>>>>>>> progress and support the low-hanging-fruit use-case first. > The > > > >>>>> choice > > > >>>>>>>>> really depends on the complexity and the importance of > supporting > > > >>>>> the > > > >>>>>>>> extra > > > >>>>>>>>> use-cases. > > > >>>>>>>> > > > >>>>>>>> That would be true, if that was a private implementation > detail or > > > >>>> if > > > >>>>>> the > > > >>>>>>>> low-hanging-fruit-solution would be on the direct path to the > final > > > >>>>>>>> solution. > > > >>>>>>>> That's unfortunately not the case here. This will add public > facing > > > >>>>>> API, > > > >>>>>>>> that we will later need to maintain, no matter what the final > > > >>>>> solution > > > >>>>>>> will > > > >>>>>>>> be, > > > >>>>>>>> and at the moment at least I don't see it being related to a > > > >>>>> "perfect" > > > >>>>>>>> solution. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Sure. Then let's decide the final solution first. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>>> I guess the point is that the suggested approach, which > > > >>>> dynamically > > > >>>>>>>>> determines the checkpointing interval based on the > backpressure, > > > >>>>> may > > > >>>>>>>> cause > > > >>>>>>>>> regression when the checkpointing interval is relatively low. > > > >>>> This > > > >>>>>>> makes > > > >>>>>>>> it > > > >>>>>>>>> hard for users to enable this feature in production. It is > like > > > >>>> an > > > >>>>>>>>> auto-driving system that is not guaranteed to work > > > >>>>>>>> > > > >>>>>>>> Yes, creating a more generic solution that would require less > > > >>>>>>> configuration > > > >>>>>>>> is usually more difficult then static configurations. > > > >>>>>>>> It doesn't mean we shouldn't try to do that. Especially that > if my > > > >>>>>>> proposed > > > >>>>>>>> algorithm wouldn't work good enough, there is > > > >>>>>>>> an obvious solution, that any source could add a metric, like > let > > > >>>> say > > > >>>>>>>> "processingBacklog: true/false", and the `CheckpointTrigger` > > > >>>>>>>> could use this as an override to always switch to the > > > >>>>>>>> "slowCheckpointInterval". I don't think we need it, but that's > > > >>>> always > > > >>>>>> an > > > >>>>>>>> option > > > >>>>>>>> that would be basically equivalent to your original proposal. > Or > > > >>>> even > > > >>>>>>>> source could add "suggestedCheckpointInterval : int", and > > > >>>>>>>> `CheckpointTrigger` could use that value if present as a hint > in > > > >>>> one > > > >>>>>> way > > > >>>>>>> or > > > >>>>>>>> another. > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> So far we have talked about the possibility of using > > > >>>> CheckpointTrigger > > > >>>>>> and > > > >>>>>>> mentioned the CheckpointTrigger > > > >>>>>>> and read metric values. > > > >>>>>>> > > > >>>>>>> Can you help answer the following questions so that I can > understand > > > >>>>> the > > > >>>>>>> alternative solution more concretely: > > > >>>>>>> > > > >>>>>>> - What is the interface of this CheckpointTrigger? For > example, are > > > >>>> we > > > >>>>>>> going to give CheckpointTrigger a context that it can use to > fetch > > > >>>>>>> arbitrary metric values? This can help us understand what > information > > > >>>>>> this > > > >>>>>>> user-defined CheckpointTrigger can use to make the checkpoint > > > >>>> decision. > > > >>>>>>> - Where is this CheckpointTrigger running? For example, is it > going > > > >>>> to > > > >>>>>> run > > > >>>>>>> on the subtask of every source operator? Or is it going to run > on the > > > >>>>> JM? > > > >>>>>>> - Are we going to provide a default implementation of this > > > >>>>>>> CheckpointTrigger in Flink that implements the algorithm > described > > > >>>>> below, > > > >>>>>>> or do we expect each source operator developer to implement > their own > > > >>>>>>> CheckpointTrigger? > > > >>>>>>> - How can users specify the > > > >>>>>> fastCheckpointInterval/slowCheckpointInterval? > > > >>>>>>> For example, will we provide APIs on the CheckpointTrigger that > > > >>>>> end-users > > > >>>>>>> can use to specify the checkpointing interval? What would that > look > > > >>>>> like? > > > >>>>>>> > > > >>>>>>> Overall, my gut feel is that the alternative approach based on > > > >>>>>>> CheckpointTrigger is more complicated and harder to use. And it > > > >>>>> probably > > > >>>>>>> also has the issues of "having two places to configure > checkpointing > > > >>>>>>> interval" and "giving flexibility for every source to > implement a > > > >>>>>> different > > > >>>>>>> API" (as mentioned below). > > > >>>>>>> > > > >>>>>>> Maybe we can evaluate it more after knowing the answers to the > above > > > >>>>>>> questions. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>>> On the other hand, the approach currently proposed in the > FLIP is > > > >>>>>> much > > > >>>>>>>>> simpler as it does not depend on backpressure. Users specify > the > > > >>>>>> extra > > > >>>>>>>>> interval requirement on the specific sources (e.g. > HybridSource, > > > >>>>>> MySQL > > > >>>>>>>> CDC > > > >>>>>>>>> Source) and can easily know the checkpointing interval will > be > > > >>>> used > > > >>>>>> on > > > >>>>>>>> the > > > >>>>>>>>> continuous phase of the corresponding source. This is pretty > much > > > >>>>>> same > > > >>>>>>> as > > > >>>>>>>>> how users use the existing execution.checkpointing.interval > > > >>>> config. > > > >>>>>> So > > > >>>>>>>>> there is no extra concern of regression caused by this > approach. > > > >>>>>>>> > > > >>>>>>>> To an extent, but as I have already previously mentioned I > really > > > >>>>>> really > > > >>>>>>> do > > > >>>>>>>> not like idea of: > > > >>>>>>>> - having two places to configure checkpointing interval > (config > > > >>>>> file > > > >>>>>>> and > > > >>>>>>>> in the Source builders) > > > >>>>>>>> - giving flexibility for every source to implement a different > > > >>>> API > > > >>>>>> for > > > >>>>>>>> that purpose > > > >>>>>>>> - creating a solution that is not generic enough, so that we > will > > > >>>>>> need > > > >>>>>>> a > > > >>>>>>>> completely different mechanism in the future anyway > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> Yeah, I understand different developers might have different > > > >>>>>>> concerns/tastes for these APIs. Ultimately, there might not be > a > > > >>>>> perfect > > > >>>>>>> solution and we have to choose based on the pros/cons of these > > > >>>>> solutions. > > > >>>>>>> > > > >>>>>>> I agree with you that, all things being equal, it is > preferable to 1) > > > >>>>>> have > > > >>>>>>> one place to configure checkpointing intervals, 2) have all > source > > > >>>>>>> operators use the same API, and 3) create a solution that is > generic > > > >>>>> and > > > >>>>>>> last lasting. Note that these three goals affects the > usability and > > > >>>>>>> extensibility of the API, but not necessarily the > > > >>>> stability/performance > > > >>>>>> of > > > >>>>>>> the production job. > > > >>>>>>> > > > >>>>>>> BTW, there are also other preferrable goals. For example, it > is very > > > >>>>>> useful > > > >>>>>>> for the job's behavior to be predictable and interpretable so > that > > > >>>> SRE > > > >>>>>> can > > > >>>>>>> operator/debug the Flink in an easier way. We can list these > > > >>>> pros/cons > > > >>>>>>> altogether later. > > > >>>>>>> > > > >>>>>>> I am wondering if we can first agree on the priority of goals > we want > > > >>>>> to > > > >>>>>>> achieve. IMO, it is a hard-requirement for the user-facing API > to be > > > >>>>>>> clearly defined and users should be able to use the API without > > > >>>> concern > > > >>>>>> of > > > >>>>>>> regression. And this requirement is more important than the > other > > > >>>> goals > > > >>>>>>> discussed above because it is related to the > stability/performance of > > > >>>>> the > > > >>>>>>> production job. What do you think? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>>> Sounds good. Looking forward to learning more ideas. > > > >>>>>>>> > > > >>>>>>>> I have thought about this a bit more, and I think we don't > need to > > > >>>>>> check > > > >>>>>>>> for the backpressure status, or how much overloaded all of the > > > >>>>>> operators > > > >>>>>>>> are. > > > >>>>>>>> We could just check three things for source operators: > > > >>>>>>>> 1. pendingRecords (backlog length) > > > >>>>>>>> 2. numRecordsInPerSecond > > > >>>>>>>> 3. backPressuredTimeMsPerSecond > > > >>>>>>>> > > > >>>>>>>> // int metricsUpdateInterval = 10s // obtained from config > > > >>>>>>>> // Next line calculates how many records can we consume from > the > > > >>>>>> backlog, > > > >>>>>>>> assuming > > > >>>>>>>> // that magically the reason behind a backpressure vanishes. > We > > > >>>> will > > > >>>>>> use > > > >>>>>>>> this only as > > > >>>>>>>> // a safeguard against scenarios like for example if > backpressure > > > >>>>> was > > > >>>>>>>> caused by some > > > >>>>>>>> // intermittent failure/performance degradation. > > > >>>>>>>> maxRecordsConsumedWithoutBackpressure = > (numRecordsInPerSecond / > > > >>>>> (1000 > > > >>>>>>>> - backPressuredTimeMsPerSecond / 1000)) * > metricsUpdateInterval > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> I am not sure if there is a typo. Because if > > > >>>>>> backPressuredTimeMsPerSecond = > > > >>>>>>> 0, then maxRecordsConsumedWithoutBackpressure = > > > >>>> numRecordsInPerSecond / > > > >>>>>>> 1000 * metricsUpdateInterval according to the above algorithm. > > > >>>>>>> > > > >>>>>>> Do you mean "maxRecordsConsumedWithoutBackpressure = > > > >>>>>> (numRecordsInPerSecond > > > >>>>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) * > > > >>>> metricsUpdateInterval"? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>> // we are excluding maxRecordsConsumedWithoutBackpressure > from the > > > >>>>>>> backlog > > > >>>>>>>> as > > > >>>>>>>> // a safeguard against an intermittent back pressure > problems, so > > > >>>>> that > > > >>>>>> we > > > >>>>>>>> don't > > > >>>>>>>> // calculate next checkpoint interval far far in the future, > while > > > >>>>> the > > > >>>>>>>> backpressure > > > >>>>>>>> // goes away before we will recalculate metrics and new > > > >>>> checkpointing > > > >>>>>>>> interval > > > >>>>>>>> timeToConsumeBacklog = (pendingRecords - > > > >>>>>>>> maxRecordsConsumedWithoutBackpressure) / numRecordsInPerSecond > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Then we can use those numbers to calculate desired > checkpointed > > > >>>>>> interval > > > >>>>>>>> for example like this: > > > >>>>>>>> > > > >>>>>>>> long calculatedCheckpointInterval = timeToConsumeBacklog / 10; > > > >>>> //this > > > >>>>>> may > > > >>>>>>>> need some refining > > > >>>>>>>> long nextCheckpointInterval = min(max(fastCheckpointInterval, > > > >>>>>>>> calculatedCheckpointInterval), slowCheckpointInterval); > > > >>>>>>>> long nextCheckpointTs = lastCheckpointTs + > nextCheckpointInterval; > > > >>>>>>>> > > > >>>>>>>> WDYT? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> I think the idea of the above algorithm is to incline to use > the > > > >>>>>>> fastCheckpointInterval unless we are very sure the backlog > will take > > > >>>> a > > > >>>>>> long > > > >>>>>>> time to process. This can alleviate the concern of regression > during > > > >>>>> the > > > >>>>>>> continuous_bounded phase since we are more likely to use the > > > >>>>>>> fastCheckpointInterval. However, it can cause regression > during the > > > >>>>>> bounded > > > >>>>>>> phase. > > > >>>>>>> > > > >>>>>>> I will use a concrete example to explain the risk of > regression: > > > >>>>>>> - The user is using HybridSource to read from HDFS followed by > Kafka. > > > >>>>> The > > > >>>>>>> data in HDFS is old and there is no need for data freshness > for the > > > >>>>> data > > > >>>>>> in > > > >>>>>>> HDFS. > > > >>>>>>> - The user configures the job as below: > > > >>>>>>> - fastCheckpointInterval = 3 minutes > > > >>>>>>> - slowCheckpointInterval = 30 minutes > > > >>>>>>> - metricsUpdateInterval = 100 ms > > > >>>>>>> > > > >>>>>>> Using the above formulate, we can know that once pendingRecords > > > >>>>>>> <= numRecordsInPerSecond * 30-minutes, then > > > >>>>> calculatedCheckpointInterval > > > >>>>>> <= > > > >>>>>>> 3 minutes, meaning that we will use slowCheckpointInterval as > the > > > >>>>>>> checkpointing interval. Then in the last 30 minutes of the > bounded > > > >>>>> phase, > > > >>>>>>> the checkpointing frequency will be 10X higher than what the > user > > > >>>>> wants. > > > >>>>>>> > > > >>>>>>> Also note that the same issue would also considerably limit the > > > >>>>> benefits > > > >>>>>> of > > > >>>>>>> the algorithm. For example, during the continuous phase, the > > > >>>> algorithm > > > >>>>>> will > > > >>>>>>> only be better than the approach in FLIP-309 when there is at > least > > > >>>>>>> 30-minutes worth of backlog in the source. > > > >>>>>>> > > > >>>>>>> Sure, having a slower checkpointing interval in this extreme > case > > > >>>>> (where > > > >>>>>>> there is 30-minutes backlog in the continous-unbounded phase) > is > > > >>>> still > > > >>>>>>> useful when this happens. But since this is the un-common > case, and > > > >>>> the > > > >>>>>>> right solution is probably to do capacity planning to avoid > this from > > > >>>>>>> happening in the first place, I am not sure it is worth > optimizing > > > >>>> for > > > >>>>>> this > > > >>>>>>> case at the cost of regression in the bounded phase and the > reduced > > > >>>>>>> operational predictability for users (e.g. what checkpointing > > > >>>> interval > > > >>>>>>> should I expect at this stage of the job). > > > >>>>>>> > > > >>>>>>> I think the fundamental issue with this algorithm is that it is > > > >>>> applied > > > >>>>>> to > > > >>>>>>> both the bounded phases and the continous_unbounded phases > without > > > >>>>>> knowing > > > >>>>>>> which phase the job is running at. The only information it can > access > > > >>>>> is > > > >>>>>>> the backlog. But two sources with the same amount of backlog > do not > > > >>>>>>> necessarily mean they have the same data freshness requirement. > > > >>>>>>> > > > >>>>>>> In this particular example, users know that the data in HDFS > is very > > > >>>>> old > > > >>>>>>> and there is no need for data freshness. Users can express > signals > > > >>>> via > > > >>>>>> the > > > >>>>>>> per-source API proposed in the FLIP. This is why the current > approach > > > >>>>> in > > > >>>>>>> FLIP-309 can be better in this case. > > > >>>>>>> > > > >>>>>>> What do you think? > > > >>>>>>> > > > >>>>>>> Best, > > > >>>>>>> Dong > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>> Best, > > > >>>>>>>> Piotrek > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >> > > > >> > > > > > > > > > >