Hi Eno,

The state is on local disk, so having the checkpoint in a topic won't help.
If the host fails permanently, then all of the local state is gone.
Starting on another host requires restoring from the earliest offset.

Thanks,
Damian

On Mon, 6 Feb 2017 at 14:58 Eno Thereska <eno.there...@gmail.com> wrote:

> Hi Damian,
>
> I am trying to figure out if this handles a common enough failure
> scenario. It seems to me this handles transient failures: a server with an
> instance fails, then comes back up shortly and the instance recovers
> quickly by reading the checkpoint file.
>
> Permanent failures, where the server fails and the instance is migrated
> onto another server are not helped since the checkpoint file is lost with
> the server down. Even if the server eventually comes up (transient failure,
> but instance has migrated), the instance would have migrated to another
> server, and it doesn't help that we have a checkpoint file locally.
>
> I was thinking a topic-based implementation would handle all scenarios.
> What am I missing? I'm basically worried that the file-based implementation
> addresses a niche problem, but can be convinced otherwise.
>
> Thanks
> Eno
>
>
>
> > On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote:
> >
> > Hi Matthias,
> >
> > It possibly doesn't make sense to disable it, but then i'm sure someone
> > will come up with a reason they don't want it!
> > I'm happy to change it such that the checkpoint interval must be > 0.
> >
> > Cheers,
> > Damian
> >
> > On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io>
> wrote:
> >
> >> Thanks Damian.
> >>
> >> One more question: "Checkpointing is disabled if the checkpoint interval
> >> is set to a value <=0."
> >>
> >>
> >> Does it make sense to disable check pointing? What's the tradeoff here?
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 2/2/17 1:51 AM, Damian Guy wrote:
> >>> Hi Matthias,
> >>>
> >>> Thanks for the comments.
> >>>
> >>> 1. TBD - i need to do some performance tests and try and work out a
> >>> sensible default.
> >>> 2. Yes, you are correct. It could be a multiple of the
> >> commit.interval.ms.
> >>> But, that would also mean if you change the commit interval - say you
> >> lower
> >>> it, then you might also need to change the checkpoint setting (i.e, you
> >>> still only want to checkpoint every n minutes).
> >>>
> >>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io>
> >> wrote:
> >>>
> >>>> Thanks for the KIP Damian.
> >>>>
> >>>> I am wondering about two things:
> >>>>
> >>>> 1. what should be the default value for the new parameter?
> >>>> 2. why is the new parameter provided in ms?
> >>>>
> >>>> About (2): because
> >>>>
> >>>> "the minimum checkpoint interval will be the value of
> >>>> commit.interval.ms. In effect the actual checkpoint interval will be
> a
> >>>> multiple of the commit interval"
> >>>>
> >>>> it might be easier to just use an parameter that is "number-or-commit
> >>>> intervals".
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 2/1/17 7:29 AM, Damian Guy wrote:
> >>>>> Thanks for the comments Eno.
> >>>>> As for exactly once, i don't believe this matters as we are just
> >>>> restoring
> >>>>> the change-log, i.e, the result of the aggregations that previously
> ran
> >>>>> etc. So once initialized the state store will be in the same state as
> >> it
> >>>>> was before.
> >>>>> Having the checkpoint in a kafka topic is not ideal as the state is
> per
> >>>>> kafka streams instance. So each instance would need to start with a
> >>>> unique
> >>>>> id that is persistent.
> >>>>>
> >>>>> Cheers,
> >>>>> Damian
> >>>>>
> >>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> As a follow up to my previous comment, have you thought about
> writing
> >>>> the
> >>>>>> checkpoint to a topic instead of a local file? That would have the
> >>>>>> advantage that all metadata continues to be managed by Kafka, as
> well
> >> as
> >>>>>> fit with EoS. The potential disadvantage would be a slower latency,
> >>>> however
> >>>>>> if it is periodic as you mention, I'm not sure that would be a show
> >>>> stopper.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Eno
> >>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> Thanks Damian, this is a good idea and will reduce the restore
> time.
> >>>>>> Looking forward, with exactly once and support for transactions in
> >>>> Kafka, I
> >>>>>> believe we'll have to add some support for rolling back checkpoints,
> >>>> e.g.,
> >>>>>> when a transaction is aborted. We need to be aware of that and
> ideally
> >>>>>> anticipate a bit those needs in the KIP.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Eno
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I would like to start the discussion on KIP-116:
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-116+-+Add+State+Store+Checkpoint+Interval+Configuration
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Damian
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
>
>

Reply via email to