Hi Damian, I am trying to figure out if this handles a common enough failure scenario. It seems to me this handles transient failures: a server with an instance fails, then comes back up shortly and the instance recovers quickly by reading the checkpoint file.
Permanent failures, where the server fails and the instance is migrated onto another server are not helped since the checkpoint file is lost with the server down. Even if the server eventually comes up (transient failure, but instance has migrated), the instance would have migrated to another server, and it doesn't help that we have a checkpoint file locally. I was thinking a topic-based implementation would handle all scenarios. What am I missing? I'm basically worried that the file-based implementation addresses a niche problem, but can be convinced otherwise. Thanks Eno > On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote: > > Hi Matthias, > > It possibly doesn't make sense to disable it, but then i'm sure someone > will come up with a reason they don't want it! > I'm happy to change it such that the checkpoint interval must be > 0. > > Cheers, > Damian > > On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io> wrote: > >> Thanks Damian. >> >> One more question: "Checkpointing is disabled if the checkpoint interval >> is set to a value <=0." >> >> >> Does it make sense to disable check pointing? What's the tradeoff here? >> >> >> -Matthias >> >> >> On 2/2/17 1:51 AM, Damian Guy wrote: >>> Hi Matthias, >>> >>> Thanks for the comments. >>> >>> 1. TBD - i need to do some performance tests and try and work out a >>> sensible default. >>> 2. Yes, you are correct. It could be a multiple of the >> commit.interval.ms. >>> But, that would also mean if you change the commit interval - say you >> lower >>> it, then you might also need to change the checkpoint setting (i.e, you >>> still only want to checkpoint every n minutes). >>> >>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io> >> wrote: >>> >>>> Thanks for the KIP Damian. >>>> >>>> I am wondering about two things: >>>> >>>> 1. what should be the default value for the new parameter? >>>> 2. why is the new parameter provided in ms? >>>> >>>> About (2): because >>>> >>>> "the minimum checkpoint interval will be the value of >>>> commit.interval.ms. In effect the actual checkpoint interval will be a >>>> multiple of the commit interval" >>>> >>>> it might be easier to just use an parameter that is "number-or-commit >>>> intervals". >>>> >>>> >>>> -Matthias >>>> >>>> >>>> On 2/1/17 7:29 AM, Damian Guy wrote: >>>>> Thanks for the comments Eno. >>>>> As for exactly once, i don't believe this matters as we are just >>>> restoring >>>>> the change-log, i.e, the result of the aggregations that previously ran >>>>> etc. So once initialized the state store will be in the same state as >> it >>>>> was before. >>>>> Having the checkpoint in a kafka topic is not ideal as the state is per >>>>> kafka streams instance. So each instance would need to start with a >>>> unique >>>>> id that is persistent. >>>>> >>>>> Cheers, >>>>> Damian >>>>> >>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com> >> wrote: >>>>> >>>>>> As a follow up to my previous comment, have you thought about writing >>>> the >>>>>> checkpoint to a topic instead of a local file? That would have the >>>>>> advantage that all metadata continues to be managed by Kafka, as well >> as >>>>>> fit with EoS. The potential disadvantage would be a slower latency, >>>> however >>>>>> if it is periodic as you mention, I'm not sure that would be a show >>>> stopper. >>>>>> >>>>>> Thanks >>>>>> Eno >>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com> >> wrote: >>>>>>> >>>>>>> Thanks Damian, this is a good idea and will reduce the restore time. >>>>>> Looking forward, with exactly once and support for transactions in >>>> Kafka, I >>>>>> believe we'll have to add some support for rolling back checkpoints, >>>> e.g., >>>>>> when a transaction is aborted. We need to be aware of that and ideally >>>>>> anticipate a bit those needs in the KIP. >>>>>>> >>>>>>> Thanks >>>>>>> Eno >>>>>>> >>>>>>> >>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I would like to start the discussion on KIP-116: >>>>>>>> >>>>>>>> >>>>>> >>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-116+-+Add+State+Store+Checkpoint+Interval+Configuration >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Damian >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >>