Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Damian Guy Fri, 10 Feb 2017 12:22:26 -0800

I'm fine with that. Gouzhang?
On Fri, 10 Feb 2017 at 19:45, Matthias J. Sax <matth...@confluent.io> wrote:


> I am actually supporting Eno's view: checkpoint on every commit.
>
> @Dhwani: I understand your view and did raise the same question about
> performance trade-off with checkpoiting enabled/disabled etc. However,
> it seems that writing the checkpoint file is super cheap -- thus, there
> is nothing to gain performance wise by disabling it.
>
> For Streams EoS we do not need the checkpoint file -- but we should have
> a switch for EoS anyway and can disable the checkpoint file for this
> case. And even if there is no switch and we enable EoS all the time, we
> can get rid of the checkpoint file overall (making the parameter obsolete).
>
> IMHO, if the config parameter is not really useful, we should not have it.
>
>
> -Matthias
>
>
> On 2/10/17 9:27 AM, Damian Guy wrote:
> > Gouzhang, Thanks for the clarification. Understood.
> >
> > Eno, you are correct if we just used commit interval then we wouldn't
> need
> > a KIP. But, then we'd have no way of turning it off.
> >
> > On Fri, 10 Feb 2017 at 17:14 Eno Thereska <eno.there...@gmail.com>
> wrote:
> >
> >> A quick check: the checkpoint file is not new, we're just exposing a
> knob
> >> on when to set it, right? Would turning if off still do what it does
> today
> >> (i.e., write the checkpoint at the end when the user quits?) So it's
> not a
> >> new feature as such, I was only recommending we dial up the frequency by
> >> default. With that option arguably we don't even need a KIP.
> >>
> >> Eno
> >>
> >>
> >>
> >>> On 10 Feb 2017, at 17:02, Guozhang Wang <wangg...@gmail.com> wrote:
> >>>
> >>> Damian,
> >>>
> >>> I was thinking if it is a new failure scenarios but as Eno pointed out
> it
> >>> was not.
> >>>
> >>> Another thing I was considering is if it has any impact for
> incorporating
> >>> KIP-98 to avoid duplicates: if there is a failure in the middle of a
> >>> transaction, then upon recovery we cannot rely on the local state store
> >>> file even if the checkpoint file exists, since the local state store
> file
> >>> may not be at the transaction boundaries. But since Streams will likely
> >> to
> >>> have EOS as an opt-in I think it is still worthwhile to add this
> feature,
> >>> just keeping in mind that when EOS is turned on it may cease to be
> >>> effective.
> >>>
> >>> And yes, I'd suggest we leave the config value to be possibly
> >> non-positive
> >>> to indicate not turning on this feature for the reason above: if it
> will
> >>> not be effective then we want to leave it as an option to be turned
> off.
> >>>
> >>> Guozhang
> >>>
> >>>
> >>> On Fri, Feb 10, 2017 at 8:06 AM, Eno Thereska <eno.there...@gmail.com>
> >>> wrote:
> >>>
> >>>> The overhead of writing to the checkpoint file should be much, much
> >>>> smaller than the overall overhead of doing a commit, so I think tuning
> >> the
> >>>> commit time is sufficient to guide performance tradeoffs.
> >>>>
> >>>> Eno
> >>>>
> >>>>> On 10 Feb 2017, at 13:08, Dhwani Katagade <
> >> dhwani_katag...@persistent.co
> >>>> .in> wrote:
> >>>>>
> >>>>> May be for fine tuning the performance.
> >>>>> Say we don't need the checkpointing and would like to gain the lil
> bit
> >>>> of performance improvement by turning it off.
> >>>>> The trade off is between giving people control knobs vs complicating
> >> the
> >>>> complete set of knobs.
> >>>>>
> >>>>> -dk
> >>>>>
> >>>>> On Friday 10 February 2017 04:05 PM, Eno Thereska wrote:
> >>>>>> I can't see why users would care to turn it off.
> >>>>>>
> >>>>>> Eno
> >>>>>>> On 10 Feb 2017, at 10:29, Damian Guy <damian....@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Eno,
> >>>>>>>
> >>>>>>> Sounds good to me. The only reason i can think of is if we want to
> be
> >>>> able
> >>>>>>> to turn it off.
> >>>>>>> Gouzhang - thoughts?
> >>>>>>>
> >>>>>>> On Fri, 10 Feb 2017 at 10:28 Eno Thereska <eno.there...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Question: if checkpointing is so cheap why not do it every commit
> >>>>>>>> interval? That way we can get rid of this extra config variable
> and
> >>>> just
> >>>>>>>> use the existing commit interval.
> >>>>>>>>
> >>>>>>>> Less tuning knobs.
> >>>>>>>>
> >>>>>>>> Eno
> >>>>>>>>
> >>>>>>>>> On 10 Feb 2017, at 09:27, Damian Guy <damian....@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Gouzhang,
> >>>>>>>>>
> >>>>>>>>> You've confused me. The failure scenarios you have described are
> >> the
> >>>> same
> >>>>>>>>> as they are today. With the checkpoint files in place less data
> >> will
> >>>> be
> >>>>>>>>> replayed, so there will be fewer duplicates.
> >>>>>>>>>
> >>>>>>>>> Are you saying you'd like the option to turn checkpointing off?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Damian
> >>>>>>>>>
> >>>>>>>>> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <wangg...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Eno,
> >>>>>>>>>>
> >>>>>>>>>> You are right, it is not a new scenario.
> >>>>>>>>>>
> >>>>>>>>>> Thinking a bit more on how we could incorporate KIP-98 in
> >> Streams, I
> >>>>>>>> feel
> >>>>>>>>>> that if EOS is turned on inside Streams, then we probably cannot
> >>>> always
> >>>>>>>>>> resume from the checkpointed offsets as it is not guaranteed to
> be
> >>>>>>>>>> "consistent"; but since EOS may not be turned on by default this
> >> is
> >>>>>>>> still
> >>>>>>>>>> worthwhile to add this feature I guess.
> >>>>>>>>>>
> >>>>>>>>>> About the default config values: I think the default value of 5
> >> min
> >>>> is
> >>>>>>>> OK
> >>>>>>>>>> to me, since restoration is usually faster than normal
> processing
> >>>>>>>> (unless
> >>>>>>>>>> your traffic was really high), about allowing it to be "turned
> >> off"
> >>>>>>>> with a
> >>>>>>>>>> non-positive value: I feel there are still values to keep this
> >> door
> >>>>>>>> open as
> >>>>>>>>>> in the future if EOS is turned on, people may just want to turn
> >> off
> >>>>>>>>>> checkpointing anyways, or there maybe other scenarios that we
> have
> >>>> not
> >>>>>>>>>> realized yet. On the other hand, I would argue that it is less
> >>>> likely
> >>>>>>>> users
> >>>>>>>>>> mistakenly set it to a non-positive value.
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska <
> >>>> eno.there...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> It seems to me we have the same semantics today. Are you saying
> >>>> there
> >>>>>>>> is
> >>>>>>>>>> a
> >>>>>>>>>>> new failure scenario?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Eno
> >>>>>>>>>>>
> >>>>>>>>>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <wangg...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> More specifically, here is my reasoning of failure cases, and
> >>>> would
> >>>>>>>>>> like
> >>>>>>>>>>> to
> >>>>>>>>>>>> get your feedbacks:
> >>>>>>>>>>>>
> >>>>>>>>>>>> *StreamTask*
> >>>>>>>>>>>>
> >>>>>>>>>>>> For stream-task, the committing order is 1) flush state (may
> >> send
> >>>> more
> >>>>>>>>>>>> records to changelog in producer), 2) flush producer, 3)
> commit
> >>>>>>>>>> upstream
> >>>>>>>>>>>> offsets. My understanding is that the writing of the
> checkpoint
> >>>> file
> >>>>>>>>>> will
> >>>>>>>>>>>> between 2) and 3). So thatt he new order will be 1) flush
> state,
> >>>> 2)
> >>>>>>>>>> flush
> >>>>>>>>>>>> producer, 3) write checkpoint file (when necessary), 4) commit
> >>>>>>>> upstream
> >>>>>>>>>>>> offsets.
> >>>>>>>>>>>>
> >>>>>>>>>>>> And we have a bunch of "changelog offsets" regarding the
> state:
> >> a)
> >>>>>>>>>> offset
> >>>>>>>>>>>> corresponding to the image of the persistent file, name it
> point
> >>>> A, b)
> >>>>>>>>>>> log
> >>>>>>>>>>>> end offset, name it offset B, c) checkpoint file recorded
> >> offset,
> >>>> name
> >>>>>>>>>> it
> >>>>>>>>>>>> offset C, d) offset corresponding to the current committed
> >>>> upstream
> >>>>>>>>>>> offset,
> >>>>>>>>>>>> name it offset D.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now let's talk about the failure cases:
> >>>>>>>>>>>>
> >>>>>>>>>>>> If there is a crash between 1) and 2), then A > B = C = D. In
> >> this
> >>>>>>>>>> case,
> >>>>>>>>>>> if
> >>>>>>>>>>>> we restore, we will replay no logs at all since B = C while
> the
> >>>>>>>>>>> persistent
> >>>>>>>>>>>> state file is actually "ahead of time", and we will start
> >>>> reprocessing
> >>>>>>>>>>>> since from the input offset corresponding to D = B < A and
> hence
> >>>> have
> >>>>>>>>>>> some
> >>>>>>>>>>>> duplicated, *which will be incorrect* if the update logic
> >> involve
> >>>>>>>>>> reading
> >>>>>>>>>>>> the state store values as well (i.e. not a blind write), e.g.
> >>>>>>>>>>> aggregations.
> >>>>>>>>>>>> If there is a crash between 2) and 3), then A = B > C = D.
> When
> >> we
> >>>>>>>>>>> restore,
> >>>>>>>>>>>> we will replay from C -> B = A, and then start reprocessing
> from
> >>>> input
> >>>>>>>>>>>> offset corresponding to D < A, and same issue applies as
> above.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If there is a crash between 3) and 4), then A = B = C > D.
> When
> >> we
> >>>>>>>>>>> restore,
> >>>>>>>>>>>> we will not replay, and then start reprocessing from input
> >> offset
> >>>>>>>>>>>> corresponding to D < A, and same issue applies as above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> *StandbyTask*
> >>>>>>>>>>>>
> >>>>>>>>>>>> We only do one operation today, which is 1) flush state, I
> think
> >>>> we
> >>>>>>>>>> will
> >>>>>>>>>>>> add the writing of the checkpoint file after it as step 2).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Failure cases again: offset A -> correspond to the image of
> the
> >>>> file,
> >>>>>>>>>>>> offset B -> changelog end offset, offset C -> written as in
> the
> >>>>>>>>>>> checkpoint
> >>>>>>>>>>>> file.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If there is a crash between 1) and 2), then B >= A > C (B >= A
> >>>> because
> >>>>>>>>>> we
> >>>>>>>>>>>> are reading from changelog topic so A will never be greater
> than
> >>>> B),
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) and if this task resumes as a standby task, we will resume
> >>>>>>>>>> restoration
> >>>>>>>>>>>> from offset C, and a few duplicates from C -> A will be
> applied
> >>>> again
> >>>>>>>>>> to
> >>>>>>>>>>>> local state files, then continue from A -> B, *this is OK*
> since
> >>>> they
> >>>>>>>>>> do
> >>>>>>>>>>>> not incur any computations hence no side effects and are all
> >>>>>>>>>> idempotent.
> >>>>>>>>>>>> 2) and if this task resumes as a stream task, we will replay
> >>>>>>>> changelogs
> >>>>>>>>>>>> from C -> A, with duplicated updates, and then from A -> B.
> This
> >>>> is
> >>>>>>>>>> also
> >>>>>>>>>>> OK
> >>>>>>>>>>>> for the same reason as above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> So it seems to me that this is not safe for a StreamTask, or
> >>>> maybe the
> >>>>>>>>>>>> writing of the checkpoint file in your mind is different?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang <
> >>>> wangg...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>> A quick question re: `We will add the above config parameter
> to
> >>>>>>>>>>>>> *StreamsConfig*. During *StreamTask#commit()*,
> >>>>>>>> *StandbyTask#commit()*,
> >>>>>>>>>>>>> and *GlobalUpdateStateTask#flushState()* we will check if the
> >>>>>>>>>>> checkpoint
> >>>>>>>>>>>>> interval has elapsed and write the checkpoint file.`
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Will the writing of the checkpoint file happen before the
> >>>> flushing of
> >>>>>>>>>>> the
> >>>>>>>>>>>>> state manager?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax <
> >>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> But 5 min means, that we (in the worst case) need to reply
> >> data
> >>>> from
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> last 5 minutes to get the store ready.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So why not go with the min possible value of 30 seconds to
> >>>> speed up
> >>>>>>>>>>> this
> >>>>>>>>>>>>>> process if the impact is negligible anyway?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What do you gain by being conservative?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>> Why shouldn't it be 5 minutes? ;-)
> >>>>>>>>>>>>>>> It is a finger in the air number. Based on the testing i
> did
> >> it
> >>>>>>>>>> shows
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>> there isn't much, if any, overhead when checkpointing a
> >> single
> >>>>>>>> store
> >>>>>>>>>>> on
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> commit interval. The default commit interval is 30 seconds,
> >> so
> >>>> it
> >>>>>>>>>>> could
> >>>>>>>>>>>>>>> possibly be set to that. However, i'd prefer to be a little
> >>>>>>>>>>>>>> conservative so
> >>>>>>>>>>>>>>> 5 minutes seemed reasonable.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll <
> >> mich...@confluent.io
> >>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Damian,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> could you elaborate briefly why the default value should
> be
> >> 5
> >>>>>>>>>>> minutes?
> >>>>>>>>>>>>>>>> What are the considerations, assumptions, etc. that go
> into
> >>>>>>>> picking
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>> value?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Right now, in the KIP and in this discussion, "5 mins"
> looks
> >>>> like
> >>>>>>>> a
> >>>>>>>>>>>>>> magic
> >>>>>>>>>>>>>>>> number to me. :-)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Michael
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy <
> >>>> damian....@gmail.com
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and off
> to
> >>>> see
> >>>>>>>>>> what
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> impact is. It appears that there is very little impact,
> if
> >>>> any.
> >>>>>>>>>> The
> >>>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>>> with checkpointing on actually look better, but that is
> >>>> likely
> >>>>>>>>>>> largely
> >>>>>>>>>>>>>>>> due
> >>>>>>>>>>>>>>>>> to external influences.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In any case, i'm going to suggest we go with a default
> >>>> checkpoint
> >>>>>>>>>>>>>>>> interval
> >>>>>>>>>>>>>>>>> of 5 minutes. I've update the KIP with this.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> commit every 10 seconds (no checkpoint)
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> checkpoint every 10 seconds (same as commit interval)
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754
> >>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
> >>>>>>>> source+store]:
> >>>>>>>>>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy <
> >> damian....@gmail.com
> >>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> Matthias,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Fair point. I'll update it the KIP.
> >>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax <
> >>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>> Damian,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I am not strict about it either. However, if there is no
> >>>>>>>>>> advantage
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> disabling it, we might not want to allow it. This would
> >>>> have the
> >>>>>>>>>>>>>>>>>> advantage to guard users to accidentally switch it off.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It possibly doesn't make sense to disable it, but then
> >> i'm
> >>>> sure
> >>>>>>>>>>>>>>>> someone
> >>>>>>>>>>>>>>>>>>> will come up with a reason they don't want it!
> >>>>>>>>>>>>>>>>>>> I'm happy to change it such that the checkpoint
> interval
> >>>> must
> >>>>>>>>>> be >
> >>>>>>>>>>>>>> 0.
> >>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <
> >>>>>>>>>>> matth...@confluent.io>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>> Thanks Damian.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> One more question: "Checkpointing is disabled if the
> >>>>>>>> checkpoint
> >>>>>>>>>>>>>>>>> interval
> >>>>>>>>>>>>>>>>>>>> is set to a value <=0."
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Does it make sense to disable check pointing? What's
> the
> >>>>>>>>>> tradeoff
> >>>>>>>>>>>>>>>>> here?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>>>> Hi Matthias,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the comments.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and try
> >> and
> >>>> work
> >>>>>>>>>>> out
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>> sensible default.
> >>>>>>>>>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of
> the
> >>>>>>>>>>>>>>>>>>>> commit.interval.ms.
> >>>>>>>>>>>>>>>>>>>>> But, that would also mean if you change the commit
> >>>> interval -
> >>>>>>>>>>> say
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>> lower
> >>>>>>>>>>>>>>>>>>>>> it, then you might also need to change the checkpoint
> >>>> setting
> >>>>>>>>>>>>>> (i.e,
> >>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>>> still only want to checkpoint every n minutes).
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <
> >>>>>>>>>>>>>> matth...@confluent.io
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP Damian.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I am wondering about two things:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1. what should be the default value for the new
> >>>> parameter?
> >>>>>>>>>>>>>>>>>>>>>> 2. why is the new parameter provided in ms?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> About (2): because
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> "the minimum checkpoint interval will be the value
> of
> >>>>>>>>>>>>>>>>>>>>>> commit.interval.ms. In effect the actual checkpoint
> >>>>>>>> interval
> >>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> multiple of the commit interval"
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> it might be easier to just use an parameter that is
> >>>>>>>>>>>>>>>>> "number-or-commit
> >>>>>>>>>>>>>>>>>>>>>> intervals".
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote:
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments Eno.
> >>>>>>>>>>>>>>>>>>>>>>> As for exactly once, i don't believe this matters
> as
> >>>> we are
> >>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>> restoring
> >>>>>>>>>>>>>>>>>>>>>>> the change-log, i.e, the result of the aggregations
> >>>> that
> >>>>>>>>>>>>>>>> previously
> >>>>>>>>>>>>>>>>>> ran
> >>>>>>>>>>>>>>>>>>>>>>> etc. So once initialized the state store will be in
> >> the
> >>>>>>>> same
> >>>>>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>>>>> was before.
> >>>>>>>>>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not ideal
> >> as
> >>>> the
> >>>>>>>>>>> state
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> per
> >>>>>>>>>>>>>>>>>>>>>>> kafka streams instance. So each instance would need
> >> to
> >>>>>>>> start
> >>>>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>>>>>>> unique
> >>>>>>>>>>>>>>>>>>>>>>> id that is persistent.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <
> >>>>>>>>>>>>>> eno.there...@gmail.com
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you
> >>>> thought
> >>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file?
> That
> >>>> would
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> advantage that all metadata continues to be
> managed
> >> by
> >>>>>>>>>> Kafka,
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> well
> >>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would be
> a
> >>>> slower
> >>>>>>>>>>>>>>>>> latency,
> >>>>>>>>>>>>>>>>>>>>>> however
> >>>>>>>>>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure
> that
> >>>> would
> >>>>>>>>>> be
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> show
> >>>>>>>>>>>>>>>>>>>>>> stopper.
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <
> >>>>>>>>>>> eno.there...@gmail.com
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will
> reduce
> >>>> the
> >>>>>>>>>>> restore
> >>>>>>>>>>>>>>>>>> time.
> >>>>>>>>>>>>>>>>>>>>>>>> Looking forward, with exactly once and support for
> >>>>>>>>>>> transactions
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> Kafka, I
> >>>>>>>>>>>>>>>>>>>>>>>> believe we'll have to add some support for rolling
> >>>> back
> >>>>>>>>>>>>>>>>> checkpoints,
> >>>>>>>>>>>>>>>>>>>>>> e.g.,
> >>>>>>>>>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be aware
> >> of
> >>>> that
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> ideally
> >>>>>>>>>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP.
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>>> Eno
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <
> >>>>>>>>>> damian....@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start the discussion on KIP-116:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+Configuration
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>> DISCLAIMER
> >>>>> ==========
> >>>>> This e-mail may contain privileged and confidential information which
> >> is
> >>>> the property of Persistent Systems Ltd. It is intended only for the
> use
> >> of
> >>>> the individual or entity to which it is addressed. If you are not the
> >>>> intended recipient, you are not authorized to read, retain, copy,
> print,
> >>>> distribute or use this message. If you have received this
> communication
> >> in
> >>>> error, please notify the sender and delete all copies of this message.
> >>>> Persistent Systems Ltd. does not accept any liability for virus
> infected
> >>>> mails.
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> -- Guozhang
> >>
> >>
> >
>
>

Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Reply via email to