> Starting on another host requires restoring from the earliest offset.

Btw, there's a special scenario where a full restore is not required:  When
the local storage (volume) is being re-used, e.g. when a container uses a
storage mount that will be re-used by a new container in case the original
one becomes unavailable.

On Mon, Feb 6, 2017 at 4:01 PM, Damian Guy <damian....@gmail.com> wrote:

> Hi Eno,
>
> The state is on local disk, so having the checkpoint in a topic won't help.
> If the host fails permanently, then all of the local state is gone.
> Starting on another host requires restoring from the earliest offset.
>
> Thanks,
> Damian
>
> On Mon, 6 Feb 2017 at 14:58 Eno Thereska <eno.there...@gmail.com> wrote:
>
> > Hi Damian,
> >
> > I am trying to figure out if this handles a common enough failure
> > scenario. It seems to me this handles transient failures: a server with
> an
> > instance fails, then comes back up shortly and the instance recovers
> > quickly by reading the checkpoint file.
> >
> > Permanent failures, where the server fails and the instance is migrated
> > onto another server are not helped since the checkpoint file is lost with
> > the server down. Even if the server eventually comes up (transient
> failure,
> > but instance has migrated), the instance would have migrated to another
> > server, and it doesn't help that we have a checkpoint file locally.
> >
> > I was thinking a topic-based implementation would handle all scenarios.
> > What am I missing? I'm basically worried that the file-based
> implementation
> > addresses a niche problem, but can be convinced otherwise.
> >
> > Thanks
> > Eno
> >
> >
> >
> > > On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote:
> > >
> > > Hi Matthias,
> > >
> > > It possibly doesn't make sense to disable it, but then i'm sure someone
> > > will come up with a reason they don't want it!
> > > I'm happy to change it such that the checkpoint interval must be > 0.
> > >
> > > Cheers,
> > > Damian
> > >
> > > On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io>
> > wrote:
> > >
> > >> Thanks Damian.
> > >>
> > >> One more question: "Checkpointing is disabled if the checkpoint
> interval
> > >> is set to a value <=0."
> > >>
> > >>
> > >> Does it make sense to disable check pointing? What's the tradeoff
> here?
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >> On 2/2/17 1:51 AM, Damian Guy wrote:
> > >>> Hi Matthias,
> > >>>
> > >>> Thanks for the comments.
> > >>>
> > >>> 1. TBD - i need to do some performance tests and try and work out a
> > >>> sensible default.
> > >>> 2. Yes, you are correct. It could be a multiple of the
> > >> commit.interval.ms.
> > >>> But, that would also mean if you change the commit interval - say you
> > >> lower
> > >>> it, then you might also need to change the checkpoint setting (i.e,
> you
> > >>> still only want to checkpoint every n minutes).
> > >>>
> > >>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io>
> > >> wrote:
> > >>>
> > >>>> Thanks for the KIP Damian.
> > >>>>
> > >>>> I am wondering about two things:
> > >>>>
> > >>>> 1. what should be the default value for the new parameter?
> > >>>> 2. why is the new parameter provided in ms?
> > >>>>
> > >>>> About (2): because
> > >>>>
> > >>>> "the minimum checkpoint interval will be the value of
> > >>>> commit.interval.ms. In effect the actual checkpoint interval will
> be
> > a
> > >>>> multiple of the commit interval"
> > >>>>
> > >>>> it might be easier to just use an parameter that is
> "number-or-commit
> > >>>> intervals".
> > >>>>
> > >>>>
> > >>>> -Matthias
> > >>>>
> > >>>>
> > >>>> On 2/1/17 7:29 AM, Damian Guy wrote:
> > >>>>> Thanks for the comments Eno.
> > >>>>> As for exactly once, i don't believe this matters as we are just
> > >>>> restoring
> > >>>>> the change-log, i.e, the result of the aggregations that previously
> > ran
> > >>>>> etc. So once initialized the state store will be in the same state
> as
> > >> it
> > >>>>> was before.
> > >>>>> Having the checkpoint in a kafka topic is not ideal as the state is
> > per
> > >>>>> kafka streams instance. So each instance would need to start with a
> > >>>> unique
> > >>>>> id that is persistent.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Damian
> > >>>>>
> > >>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> As a follow up to my previous comment, have you thought about
> > writing
> > >>>> the
> > >>>>>> checkpoint to a topic instead of a local file? That would have the
> > >>>>>> advantage that all metadata continues to be managed by Kafka, as
> > well
> > >> as
> > >>>>>> fit with EoS. The potential disadvantage would be a slower
> latency,
> > >>>> however
> > >>>>>> if it is periodic as you mention, I'm not sure that would be a
> show
> > >>>> stopper.
> > >>>>>>
> > >>>>>> Thanks
> > >>>>>> Eno
> > >>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> Thanks Damian, this is a good idea and will reduce the restore
> > time.
> > >>>>>> Looking forward, with exactly once and support for transactions in
> > >>>> Kafka, I
> > >>>>>> believe we'll have to add some support for rolling back
> checkpoints,
> > >>>> e.g.,
> > >>>>>> when a transaction is aborted. We need to be aware of that and
> > ideally
> > >>>>>> anticipate a bit those needs in the KIP.
> > >>>>>>>
> > >>>>>>> Thanks
> > >>>>>>> Eno
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com>
> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi all,
> > >>>>>>>>
> > >>>>>>>> I would like to start the discussion on KIP-116:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 116+-+Add+State+Store+Checkpoint+Interval+Configuration
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Damian
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> >
> >
>

Reply via email to