> Starting on another host requires restoring from the earliest offset. Btw, there's a special scenario where a full restore is not required: When the local storage (volume) is being re-used, e.g. when a container uses a storage mount that will be re-used by a new container in case the original one becomes unavailable.
On Mon, Feb 6, 2017 at 4:01 PM, Damian Guy <damian....@gmail.com> wrote: > Hi Eno, > > The state is on local disk, so having the checkpoint in a topic won't help. > If the host fails permanently, then all of the local state is gone. > Starting on another host requires restoring from the earliest offset. > > Thanks, > Damian > > On Mon, 6 Feb 2017 at 14:58 Eno Thereska <eno.there...@gmail.com> wrote: > > > Hi Damian, > > > > I am trying to figure out if this handles a common enough failure > > scenario. It seems to me this handles transient failures: a server with > an > > instance fails, then comes back up shortly and the instance recovers > > quickly by reading the checkpoint file. > > > > Permanent failures, where the server fails and the instance is migrated > > onto another server are not helped since the checkpoint file is lost with > > the server down. Even if the server eventually comes up (transient > failure, > > but instance has migrated), the instance would have migrated to another > > server, and it doesn't help that we have a checkpoint file locally. > > > > I was thinking a topic-based implementation would handle all scenarios. > > What am I missing? I'm basically worried that the file-based > implementation > > addresses a niche problem, but can be convinced otherwise. > > > > Thanks > > Eno > > > > > > > > > On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote: > > > > > > Hi Matthias, > > > > > > It possibly doesn't make sense to disable it, but then i'm sure someone > > > will come up with a reason they don't want it! > > > I'm happy to change it such that the checkpoint interval must be > 0. > > > > > > Cheers, > > > Damian > > > > > > On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > >> Thanks Damian. > > >> > > >> One more question: "Checkpointing is disabled if the checkpoint > interval > > >> is set to a value <=0." > > >> > > >> > > >> Does it make sense to disable check pointing? What's the tradeoff > here? > > >> > > >> > > >> -Matthias > > >> > > >> > > >> On 2/2/17 1:51 AM, Damian Guy wrote: > > >>> Hi Matthias, > > >>> > > >>> Thanks for the comments. > > >>> > > >>> 1. TBD - i need to do some performance tests and try and work out a > > >>> sensible default. > > >>> 2. Yes, you are correct. It could be a multiple of the > > >> commit.interval.ms. > > >>> But, that would also mean if you change the commit interval - say you > > >> lower > > >>> it, then you might also need to change the checkpoint setting (i.e, > you > > >>> still only want to checkpoint every n minutes). > > >>> > > >>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io> > > >> wrote: > > >>> > > >>>> Thanks for the KIP Damian. > > >>>> > > >>>> I am wondering about two things: > > >>>> > > >>>> 1. what should be the default value for the new parameter? > > >>>> 2. why is the new parameter provided in ms? > > >>>> > > >>>> About (2): because > > >>>> > > >>>> "the minimum checkpoint interval will be the value of > > >>>> commit.interval.ms. In effect the actual checkpoint interval will > be > > a > > >>>> multiple of the commit interval" > > >>>> > > >>>> it might be easier to just use an parameter that is > "number-or-commit > > >>>> intervals". > > >>>> > > >>>> > > >>>> -Matthias > > >>>> > > >>>> > > >>>> On 2/1/17 7:29 AM, Damian Guy wrote: > > >>>>> Thanks for the comments Eno. > > >>>>> As for exactly once, i don't believe this matters as we are just > > >>>> restoring > > >>>>> the change-log, i.e, the result of the aggregations that previously > > ran > > >>>>> etc. So once initialized the state store will be in the same state > as > > >> it > > >>>>> was before. > > >>>>> Having the checkpoint in a kafka topic is not ideal as the state is > > per > > >>>>> kafka streams instance. So each instance would need to start with a > > >>>> unique > > >>>>> id that is persistent. > > >>>>> > > >>>>> Cheers, > > >>>>> Damian > > >>>>> > > >>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com> > > >> wrote: > > >>>>> > > >>>>>> As a follow up to my previous comment, have you thought about > > writing > > >>>> the > > >>>>>> checkpoint to a topic instead of a local file? That would have the > > >>>>>> advantage that all metadata continues to be managed by Kafka, as > > well > > >> as > > >>>>>> fit with EoS. The potential disadvantage would be a slower > latency, > > >>>> however > > >>>>>> if it is periodic as you mention, I'm not sure that would be a > show > > >>>> stopper. > > >>>>>> > > >>>>>> Thanks > > >>>>>> Eno > > >>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com> > > >> wrote: > > >>>>>>> > > >>>>>>> Thanks Damian, this is a good idea and will reduce the restore > > time. > > >>>>>> Looking forward, with exactly once and support for transactions in > > >>>> Kafka, I > > >>>>>> believe we'll have to add some support for rolling back > checkpoints, > > >>>> e.g., > > >>>>>> when a transaction is aborted. We need to be aware of that and > > ideally > > >>>>>> anticipate a bit those needs in the KIP. > > >>>>>>> > > >>>>>>> Thanks > > >>>>>>> Eno > > >>>>>>> > > >>>>>>> > > >>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com> > wrote: > > >>>>>>>> > > >>>>>>>> Hi all, > > >>>>>>>> > > >>>>>>>> I would like to start the discussion on KIP-116: > > >>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 116+-+Add+State+Store+Checkpoint+Interval+Configuration > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Damian > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > > > >