Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Eno Thereska Tue, 14 Feb 2017 04:16:16 -0800

Even if users commit on every record, the expensive part will not be the 
checkpointing proposed in this KIP, but the rest of the commit.


Eno


> On 13 Feb 2017, at 23:46, Guozhang Wang <wangg...@gmail.com> wrote:
> 
> I think I'm OK to always enable checkpointing, but I'm not sure if we want
> to checkpoint on every commit. Since in the extreme case users can commit
> on completed processing each record. So I think it is still valuable to
> have a checkpoint internal config in this KIP, which can be ignored if EOS
> is turned on. That being said, if most people are favoring checkpointing on
> each commit we can try that with this as well, since it won't change any
> public APIs and we can still add this config in the future if we do observe
> some users reporting it has huge perf impacts.
> 
> 
> 
> Guozhang
> 
> On Fri, Feb 10, 2017 at 12:20 PM, Damian Guy <damian....@gmail.com> wrote:
> 
>> I'm fine with that. Gouzhang?
>> On Fri, 10 Feb 2017 at 19:45, Matthias J. Sax <matth...@confluent.io>
>> wrote:
>> 
>>> I am actually supporting Eno's view: checkpoint on every commit.
>>> 
>>> @Dhwani: I understand your view and did raise the same question about
>>> performance trade-off with checkpoiting enabled/disabled etc. However,
>>> it seems that writing the checkpoint file is super cheap -- thus, there
>>> is nothing to gain performance wise by disabling it.
>>> 
>>> For Streams EoS we do not need the checkpoint file -- but we should have
>>> a switch for EoS anyway and can disable the checkpoint file for this
>>> case. And even if there is no switch and we enable EoS all the time, we
>>> can get rid of the checkpoint file overall (making the parameter
>> obsolete).
>>> 
>>> IMHO, if the config parameter is not really useful, we should not have
>> it.
>>> 
>>> 
>>> -Matthias
>>> 
>>> 
>>> On 2/10/17 9:27 AM, Damian Guy wrote:
>>>> Gouzhang, Thanks for the clarification. Understood.
>>>> 
>>>> Eno, you are correct if we just used commit interval then we wouldn't
>>> need
>>>> a KIP. But, then we'd have no way of turning it off.
>>>> 
>>>> On Fri, 10 Feb 2017 at 17:14 Eno Thereska <eno.there...@gmail.com>
>>> wrote:
>>>> 
>>>>> A quick check: the checkpoint file is not new, we're just exposing a
>>> knob
>>>>> on when to set it, right? Would turning if off still do what it does
>>> today
>>>>> (i.e., write the checkpoint at the end when the user quits?) So it's
>>> not a
>>>>> new feature as such, I was only recommending we dial up the frequency
>> by
>>>>> default. With that option arguably we don't even need a KIP.
>>>>> 
>>>>> Eno
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 10 Feb 2017, at 17:02, Guozhang Wang <wangg...@gmail.com> wrote:
>>>>>> 
>>>>>> Damian,
>>>>>> 
>>>>>> I was thinking if it is a new failure scenarios but as Eno pointed
>> out
>>> it
>>>>>> was not.
>>>>>> 
>>>>>> Another thing I was considering is if it has any impact for
>>> incorporating
>>>>>> KIP-98 to avoid duplicates: if there is a failure in the middle of a
>>>>>> transaction, then upon recovery we cannot rely on the local state
>> store
>>>>>> file even if the checkpoint file exists, since the local state store
>>> file
>>>>>> may not be at the transaction boundaries. But since Streams will
>> likely
>>>>> to
>>>>>> have EOS as an opt-in I think it is still worthwhile to add this
>>> feature,
>>>>>> just keeping in mind that when EOS is turned on it may cease to be
>>>>>> effective.
>>>>>> 
>>>>>> And yes, I'd suggest we leave the config value to be possibly
>>>>> non-positive
>>>>>> to indicate not turning on this feature for the reason above: if it
>>> will
>>>>>> not be effective then we want to leave it as an option to be turned
>>> off.
>>>>>> 
>>>>>> Guozhang
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 10, 2017 at 8:06 AM, Eno Thereska <
>> eno.there...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> The overhead of writing to the checkpoint file should be much, much
>>>>>>> smaller than the overall overhead of doing a commit, so I think
>> tuning
>>>>> the
>>>>>>> commit time is sufficient to guide performance tradeoffs.
>>>>>>> 
>>>>>>> Eno
>>>>>>> 
>>>>>>>> On 10 Feb 2017, at 13:08, Dhwani Katagade <
>>>>> dhwani_katag...@persistent.co
>>>>>>> .in> wrote:
>>>>>>>> 
>>>>>>>> May be for fine tuning the performance.
>>>>>>>> Say we don't need the checkpointing and would like to gain the lil
>>> bit
>>>>>>> of performance improvement by turning it off.
>>>>>>>> The trade off is between giving people control knobs vs
>> complicating
>>>>> the
>>>>>>> complete set of knobs.
>>>>>>>> 
>>>>>>>> -dk
>>>>>>>> 
>>>>>>>> On Friday 10 February 2017 04:05 PM, Eno Thereska wrote:
>>>>>>>>> I can't see why users would care to turn it off.
>>>>>>>>> 
>>>>>>>>> Eno
>>>>>>>>>> On 10 Feb 2017, at 10:29, Damian Guy <damian....@gmail.com>
>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Eno,
>>>>>>>>>> 
>>>>>>>>>> Sounds good to me. The only reason i can think of is if we want
>> to
>>> be
>>>>>>> able
>>>>>>>>>> to turn it off.
>>>>>>>>>> Gouzhang - thoughts?
>>>>>>>>>> 
>>>>>>>>>> On Fri, 10 Feb 2017 at 10:28 Eno Thereska <
>> eno.there...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Question: if checkpointing is so cheap why not do it every
>> commit
>>>>>>>>>>> interval? That way we can get rid of this extra config variable
>>> and
>>>>>>> just
>>>>>>>>>>> use the existing commit interval.
>>>>>>>>>>> 
>>>>>>>>>>> Less tuning knobs.
>>>>>>>>>>> 
>>>>>>>>>>> Eno
>>>>>>>>>>> 
>>>>>>>>>>>> On 10 Feb 2017, at 09:27, Damian Guy <damian....@gmail.com>
>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Gouzhang,
>>>>>>>>>>>> 
>>>>>>>>>>>> You've confused me. The failure scenarios you have described
>> are
>>>>> the
>>>>>>> same
>>>>>>>>>>>> as they are today. With the checkpoint files in place less data
>>>>> will
>>>>>>> be
>>>>>>>>>>>> replayed, so there will be fewer duplicates.
>>>>>>>>>>>> 
>>>>>>>>>>>> Are you saying you'd like the option to turn checkpointing off?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Damian
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <wangg...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Eno,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You are right, it is not a new scenario.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thinking a bit more on how we could incorporate KIP-98 in
>>>>> Streams, I
>>>>>>>>>>> feel
>>>>>>>>>>>>> that if EOS is turned on inside Streams, then we probably
>> cannot
>>>>>>> always
>>>>>>>>>>>>> resume from the checkpointed offsets as it is not guaranteed
>> to
>>> be
>>>>>>>>>>>>> "consistent"; but since EOS may not be turned on by default
>> this
>>>>> is
>>>>>>>>>>> still
>>>>>>>>>>>>> worthwhile to add this feature I guess.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> About the default config values: I think the default value of
>> 5
>>>>> min
>>>>>>> is
>>>>>>>>>>> OK
>>>>>>>>>>>>> to me, since restoration is usually faster than normal
>>> processing
>>>>>>>>>>> (unless
>>>>>>>>>>>>> your traffic was really high), about allowing it to be "turned
>>>>> off"
>>>>>>>>>>> with a
>>>>>>>>>>>>> non-positive value: I feel there are still values to keep this
>>>>> door
>>>>>>>>>>> open as
>>>>>>>>>>>>> in the future if EOS is turned on, people may just want to
>> turn
>>>>> off
>>>>>>>>>>>>> checkpointing anyways, or there maybe other scenarios that we
>>> have
>>>>>>> not
>>>>>>>>>>>>> realized yet. On the other hand, I would argue that it is less
>>>>>>> likely
>>>>>>>>>>> users
>>>>>>>>>>>>> mistakenly set it to a non-positive value.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska <
>>>>>>> eno.there...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Guozhang,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It seems to me we have the same semantics today. Are you
>> saying
>>>>>>> there
>>>>>>>>>>> is
>>>>>>>>>>>>> a
>>>>>>>>>>>>>> new failure scenario?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Eno
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <wangg...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> More specifically, here is my reasoning of failure cases,
>> and
>>>>>>> would
>>>>>>>>>>>>> like
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> get your feedbacks:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *StreamTask*
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For stream-task, the committing order is 1) flush state (may
>>>>> send
>>>>>>> more
>>>>>>>>>>>>>>> records to changelog in producer), 2) flush producer, 3)
>>> commit
>>>>>>>>>>>>> upstream
>>>>>>>>>>>>>>> offsets. My understanding is that the writing of the
>>> checkpoint
>>>>>>> file
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> between 2) and 3). So thatt he new order will be 1) flush
>>> state,
>>>>>>> 2)
>>>>>>>>>>>>> flush
>>>>>>>>>>>>>>> producer, 3) write checkpoint file (when necessary), 4)
>> commit
>>>>>>>>>>> upstream
>>>>>>>>>>>>>>> offsets.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> And we have a bunch of "changelog offsets" regarding the
>>> state:
>>>>> a)
>>>>>>>>>>>>> offset
>>>>>>>>>>>>>>> corresponding to the image of the persistent file, name it
>>> point
>>>>>>> A, b)
>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>> end offset, name it offset B, c) checkpoint file recorded
>>>>> offset,
>>>>>>> name
>>>>>>>>>>>>> it
>>>>>>>>>>>>>>> offset C, d) offset corresponding to the current committed
>>>>>>> upstream
>>>>>>>>>>>>>> offset,
>>>>>>>>>>>>>>> name it offset D.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Now let's talk about the failure cases:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If there is a crash between 1) and 2), then A > B = C = D.
>> In
>>>>> this
>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>> we restore, we will replay no logs at all since B = C while
>>> the
>>>>>>>>>>>>>> persistent
>>>>>>>>>>>>>>> state file is actually "ahead of time", and we will start
>>>>>>> reprocessing
>>>>>>>>>>>>>>> since from the input offset corresponding to D = B < A and
>>> hence
>>>>>>> have
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> duplicated, *which will be incorrect* if the update logic
>>>>> involve
>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>> the state store values as well (i.e. not a blind write),
>> e.g.
>>>>>>>>>>>>>> aggregations.
>>>>>>>>>>>>>>> If there is a crash between 2) and 3), then A = B > C = D.
>>> When
>>>>> we
>>>>>>>>>>>>>> restore,
>>>>>>>>>>>>>>> we will replay from C -> B = A, and then start reprocessing
>>> from
>>>>>>> input
>>>>>>>>>>>>>>> offset corresponding to D < A, and same issue applies as
>>> above.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If there is a crash between 3) and 4), then A = B = C > D.
>>> When
>>>>> we
>>>>>>>>>>>>>> restore,
>>>>>>>>>>>>>>> we will not replay, and then start reprocessing from input
>>>>> offset
>>>>>>>>>>>>>>> corresponding to D < A, and same issue applies as above.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *StandbyTask*
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We only do one operation today, which is 1) flush state, I
>>> think
>>>>>>> we
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> add the writing of the checkpoint file after it as step 2).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Failure cases again: offset A -> correspond to the image of
>>> the
>>>>>>> file,
>>>>>>>>>>>>>>> offset B -> changelog end offset, offset C -> written as in
>>> the
>>>>>>>>>>>>>> checkpoint
>>>>>>>>>>>>>>> file.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If there is a crash between 1) and 2), then B >= A > C (B
>>> = A
>>>>>>> because
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> are reading from changelog topic so A will never be greater
>>> than
>>>>>>> B),
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) and if this task resumes as a standby task, we will
>> resume
>>>>>>>>>>>>> restoration
>>>>>>>>>>>>>>> from offset C, and a few duplicates from C -> A will be
>>> applied
>>>>>>> again
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> local state files, then continue from A -> B, *this is OK*
>>> since
>>>>>>> they
>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> not incur any computations hence no side effects and are all
>>>>>>>>>>>>> idempotent.
>>>>>>>>>>>>>>> 2) and if this task resumes as a stream task, we will replay
>>>>>>>>>>> changelogs
>>>>>>>>>>>>>>> from C -> A, with duplicated updates, and then from A -> B.
>>> This
>>>>>>> is
>>>>>>>>>>>>> also
>>>>>>>>>>>>>> OK
>>>>>>>>>>>>>>> for the same reason as above.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So it seems to me that this is not safe for a StreamTask, or
>>>>>>> maybe the
>>>>>>>>>>>>>>> writing of the checkpoint file in your mind is different?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang <
>>>>>>> wangg...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> A quick question re: `We will add the above config
>> parameter
>>> to
>>>>>>>>>>>>>>>> *StreamsConfig*. During *StreamTask#commit()*,
>>>>>>>>>>> *StandbyTask#commit()*,
>>>>>>>>>>>>>>>> and *GlobalUpdateStateTask#flushState()* we will check if
>> the
>>>>>>>>>>>>>> checkpoint
>>>>>>>>>>>>>>>> interval has elapsed and write the checkpoint file.`
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Will the writing of the checkpoint file happen before the
>>>>>>> flushing of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> state manager?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax <
>>>>>>>>>>>>> matth...@confluent.io
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> But 5 min means, that we (in the worst case) need to reply
>>>>> data
>>>>>>> from
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> last 5 minutes to get the store ready.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So why not go with the min possible value of 30 seconds to
>>>>>>> speed up
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> process if the impact is negligible anyway?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What do you gain by being conservative?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote:
>>>>>>>>>>>>>>>>>> Why shouldn't it be 5 minutes? ;-)
>>>>>>>>>>>>>>>>>> It is a finger in the air number. Based on the testing i
>>> did
>>>>> it
>>>>>>>>>>>>> shows
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> there isn't much, if any, overhead when checkpointing a
>>>>> single
>>>>>>>>>>> store
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> commit interval. The default commit interval is 30
>> seconds,
>>>>> so
>>>>>>> it
>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>> possibly be set to that. However, i'd prefer to be a
>> little
>>>>>>>>>>>>>>>>> conservative so
>>>>>>>>>>>>>>>>>> 5 minutes seemed reasonable.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll <
>>>>> mich...@confluent.io
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Damian,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> could you elaborate briefly why the default value should
>>> be
>>>>> 5
>>>>>>>>>>>>>> minutes?
>>>>>>>>>>>>>>>>>>> What are the considerations, assumptions, etc. that go
>>> into
>>>>>>>>>>> picking
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Right now, in the KIP and in this discussion, "5 mins"
>>> looks
>>>>>>> like
>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> magic
>>>>>>>>>>>>>>>>>>> number to me. :-)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -Michael
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy <
>>>>>>> damian....@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and off
>>> to
>>>>>>> see
>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> impact is. It appears that there is very little impact,
>>> if
>>>>>>> any.
>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>> numbers
>>>>>>>>>>>>>>>>>>>> with checkpointing on actually look better, but that is
>>>>>>> likely
>>>>>>>>>>>>>> largely
>>>>>>>>>>>>>>>>>>> due
>>>>>>>>>>>>>>>>>>>> to external influences.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> In any case, i'm going to suggest we go with a default
>>>>>>> checkpoint
>>>>>>>>>>>>>>>>>>> interval
>>>>>>>>>>>>>>>>>>>> of 5 minutes. I've update the KIP with this.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> commit every 10 seconds (no checkpoint)
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> checkpoint every 10 seconds (same as commit interval)
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754
>>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec
>>>>>>>>>>> source+store]:
>>>>>>>>>>>>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy <
>>>>> damian....@gmail.com
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Matthias,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Fair point. I'll update it the KIP.
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax <
>>>>>>>>>>>>> matth...@confluent.io
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Damian,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I am not strict about it either. However, if there is
>> no
>>>>>>>>>>>>> advantage
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> disabling it, we might not want to allow it. This
>> would
>>>>>>> have the
>>>>>>>>>>>>>>>>>>>>> advantage to guard users to accidentally switch it
>> off.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> It possibly doesn't make sense to disable it, but
>> then
>>>>> i'm
>>>>>>> sure
>>>>>>>>>>>>>>>>>>> someone
>>>>>>>>>>>>>>>>>>>>>> will come up with a reason they don't want it!
>>>>>>>>>>>>>>>>>>>>>> I'm happy to change it such that the checkpoint
>>> interval
>>>>>>> must
>>>>>>>>>>>>> be >
>>>>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <
>>>>>>>>>>>>>> matth...@confluent.io>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> Thanks Damian.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> One more question: "Checkpointing is disabled if the
>>>>>>>>>>> checkpoint
>>>>>>>>>>>>>>>>>>>> interval
>>>>>>>>>>>>>>>>>>>>>>> is set to a value <=0."
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Does it make sense to disable check pointing? What's
>>> the
>>>>>>>>>>>>> tradeoff
>>>>>>>>>>>>>>>>>>>> here?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and
>> try
>>>>> and
>>>>>>> work
>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> sensible default.
>>>>>>>>>>>>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of
>>> the
>>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms.
>>>>>>>>>>>>>>>>>>>>>>>> But, that would also mean if you change the commit
>>>>>>> interval -
>>>>>>>>>>>>>> say
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> lower
>>>>>>>>>>>>>>>>>>>>>>>> it, then you might also need to change the
>> checkpoint
>>>>>>> setting
>>>>>>>>>>>>>>>>> (i.e,
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> still only want to checkpoint every n minutes).
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <
>>>>>>>>>>>>>>>>> matth...@confluent.io
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP Damian.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I am wondering about two things:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 1. what should be the default value for the new
>>>>>>> parameter?
>>>>>>>>>>>>>>>>>>>>>>>>> 2. why is the new parameter provided in ms?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> About (2): because
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> "the minimum checkpoint interval will be the value
>>> of
>>>>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms. In effect the actual
>> checkpoint
>>>>>>>>>>> interval
>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>> multiple of the commit interval"
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> it might be easier to just use an parameter that
>> is
>>>>>>>>>>>>>>>>>>>> "number-or-commit
>>>>>>>>>>>>>>>>>>>>>>>>> intervals".
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments Eno.
>>>>>>>>>>>>>>>>>>>>>>>>>> As for exactly once, i don't believe this matters
>>> as
>>>>>>> we are
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>> restoring
>>>>>>>>>>>>>>>>>>>>>>>>>> the change-log, i.e, the result of the
>> aggregations
>>>>>>> that
>>>>>>>>>>>>>>>>>>> previously
>>>>>>>>>>>>>>>>>>>>> ran
>>>>>>>>>>>>>>>>>>>>>>>>>> etc. So once initialized the state store will be
>> in
>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>> was before.
>>>>>>>>>>>>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not
>> ideal
>>>>> as
>>>>>>> the
>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>>>>>>>>>>>>> kafka streams instance. So each instance would
>> need
>>>>> to
>>>>>>>>>>> start
>>>>>>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>>>>>>>>>>> unique
>>>>>>>>>>>>>>>>>>>>>>>>>> id that is persistent.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <
>>>>>>>>>>>>>>>>> eno.there...@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you
>>>>>>> thought
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> writing
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file?
>>> That
>>>>>>> would
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> advantage that all metadata continues to be
>>> managed
>>>>> by
>>>>>>>>>>>>> Kafka,
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> well
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would
>> be
>>> a
>>>>>>> slower
>>>>>>>>>>>>>>>>>>>> latency,
>>>>>>>>>>>>>>>>>>>>>>>>> however
>>>>>>>>>>>>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure
>>> that
>>>>>>> would
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> show
>>>>>>>>>>>>>>>>>>>>>>>>> stopper.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>> Eno
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <
>>>>>>>>>>>>>> eno.there...@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will
>>> reduce
>>>>>>> the
>>>>>>>>>>>>>> restore
>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward, with exactly once and support
>> for
>>>>>>>>>>>>>> transactions
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> Kafka, I
>>>>>>>>>>>>>>>>>>>>>>>>>>> believe we'll have to add some support for
>> rolling
>>>>>>> back
>>>>>>>>>>>>>>>>>>>> checkpoints,
>>>>>>>>>>>>>>>>>>>>>>>>> e.g.,
>>>>>>>>>>>>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be
>> aware
>>>>> of
>>>>>>> that
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> ideally
>>>>>>>>>>>>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eno
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <
>>>>>>>>>>>>> damian....@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start the discussion on
>> KIP-116:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+
>> Configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> DISCLAIMER
>>>>>>>> ==========
>>>>>>>> This e-mail may contain privileged and confidential information
>> which
>>>>> is
>>>>>>> the property of Persistent Systems Ltd. It is intended only for the
>>> use
>>>>> of
>>>>>>> the individual or entity to which it is addressed. If you are not
>> the
>>>>>>> intended recipient, you are not authorized to read, retain, copy,
>>> print,
>>>>>>> distribute or use this message. If you have received this
>>> communication
>>>>> in
>>>>>>> error, please notify the sender and delete all copies of this
>> message.
>>>>>>> Persistent Systems Ltd. does not accept any liability for virus
>>> infected
>>>>>>> mails.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> -- Guozhang
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> -- Guozhang

Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Reply via email to