Hi Eno, The state is on local disk, so having the checkpoint in a topic won't help. If the host fails permanently, then all of the local state is gone. Starting on another host requires restoring from the earliest offset.
Thanks, Damian On Mon, 6 Feb 2017 at 14:58 Eno Thereska <eno.there...@gmail.com> wrote: > Hi Damian, > > I am trying to figure out if this handles a common enough failure > scenario. It seems to me this handles transient failures: a server with an > instance fails, then comes back up shortly and the instance recovers > quickly by reading the checkpoint file. > > Permanent failures, where the server fails and the instance is migrated > onto another server are not helped since the checkpoint file is lost with > the server down. Even if the server eventually comes up (transient failure, > but instance has migrated), the instance would have migrated to another > server, and it doesn't help that we have a checkpoint file locally. > > I was thinking a topic-based implementation would handle all scenarios. > What am I missing? I'm basically worried that the file-based implementation > addresses a niche problem, but can be convinced otherwise. > > Thanks > Eno > > > > > On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote: > > > > Hi Matthias, > > > > It possibly doesn't make sense to disable it, but then i'm sure someone > > will come up with a reason they don't want it! > > I'm happy to change it such that the checkpoint interval must be > 0. > > > > Cheers, > > Damian > > > > On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io> > wrote: > > > >> Thanks Damian. > >> > >> One more question: "Checkpointing is disabled if the checkpoint interval > >> is set to a value <=0." > >> > >> > >> Does it make sense to disable check pointing? What's the tradeoff here? > >> > >> > >> -Matthias > >> > >> > >> On 2/2/17 1:51 AM, Damian Guy wrote: > >>> Hi Matthias, > >>> > >>> Thanks for the comments. > >>> > >>> 1. TBD - i need to do some performance tests and try and work out a > >>> sensible default. > >>> 2. Yes, you are correct. It could be a multiple of the > >> commit.interval.ms. > >>> But, that would also mean if you change the commit interval - say you > >> lower > >>> it, then you might also need to change the checkpoint setting (i.e, you > >>> still only want to checkpoint every n minutes). > >>> > >>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io> > >> wrote: > >>> > >>>> Thanks for the KIP Damian. > >>>> > >>>> I am wondering about two things: > >>>> > >>>> 1. what should be the default value for the new parameter? > >>>> 2. why is the new parameter provided in ms? > >>>> > >>>> About (2): because > >>>> > >>>> "the minimum checkpoint interval will be the value of > >>>> commit.interval.ms. In effect the actual checkpoint interval will be > a > >>>> multiple of the commit interval" > >>>> > >>>> it might be easier to just use an parameter that is "number-or-commit > >>>> intervals". > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> > >>>> On 2/1/17 7:29 AM, Damian Guy wrote: > >>>>> Thanks for the comments Eno. > >>>>> As for exactly once, i don't believe this matters as we are just > >>>> restoring > >>>>> the change-log, i.e, the result of the aggregations that previously > ran > >>>>> etc. So once initialized the state store will be in the same state as > >> it > >>>>> was before. > >>>>> Having the checkpoint in a kafka topic is not ideal as the state is > per > >>>>> kafka streams instance. So each instance would need to start with a > >>>> unique > >>>>> id that is persistent. > >>>>> > >>>>> Cheers, > >>>>> Damian > >>>>> > >>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com> > >> wrote: > >>>>> > >>>>>> As a follow up to my previous comment, have you thought about > writing > >>>> the > >>>>>> checkpoint to a topic instead of a local file? That would have the > >>>>>> advantage that all metadata continues to be managed by Kafka, as > well > >> as > >>>>>> fit with EoS. The potential disadvantage would be a slower latency, > >>>> however > >>>>>> if it is periodic as you mention, I'm not sure that would be a show > >>>> stopper. > >>>>>> > >>>>>> Thanks > >>>>>> Eno > >>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>> Thanks Damian, this is a good idea and will reduce the restore > time. > >>>>>> Looking forward, with exactly once and support for transactions in > >>>> Kafka, I > >>>>>> believe we'll have to add some support for rolling back checkpoints, > >>>> e.g., > >>>>>> when a transaction is aborted. We need to be aware of that and > ideally > >>>>>> anticipate a bit those needs in the KIP. > >>>>>>> > >>>>>>> Thanks > >>>>>>> Eno > >>>>>>> > >>>>>>> > >>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com> wrote: > >>>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> I would like to start the discussion on KIP-116: > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-116+-+Add+State+Store+Checkpoint+Interval+Configuration > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Damian > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > >