Even if users commit on every record, the expensive part will not be the checkpointing proposed in this KIP, but the rest of the commit.
Eno > On 13 Feb 2017, at 23:46, Guozhang Wang <wangg...@gmail.com> wrote: > > I think I'm OK to always enable checkpointing, but I'm not sure if we want > to checkpoint on every commit. Since in the extreme case users can commit > on completed processing each record. So I think it is still valuable to > have a checkpoint internal config in this KIP, which can be ignored if EOS > is turned on. That being said, if most people are favoring checkpointing on > each commit we can try that with this as well, since it won't change any > public APIs and we can still add this config in the future if we do observe > some users reporting it has huge perf impacts. > > > > Guozhang > > On Fri, Feb 10, 2017 at 12:20 PM, Damian Guy <damian....@gmail.com> wrote: > >> I'm fine with that. Gouzhang? >> On Fri, 10 Feb 2017 at 19:45, Matthias J. Sax <matth...@confluent.io> >> wrote: >> >>> I am actually supporting Eno's view: checkpoint on every commit. >>> >>> @Dhwani: I understand your view and did raise the same question about >>> performance trade-off with checkpoiting enabled/disabled etc. However, >>> it seems that writing the checkpoint file is super cheap -- thus, there >>> is nothing to gain performance wise by disabling it. >>> >>> For Streams EoS we do not need the checkpoint file -- but we should have >>> a switch for EoS anyway and can disable the checkpoint file for this >>> case. And even if there is no switch and we enable EoS all the time, we >>> can get rid of the checkpoint file overall (making the parameter >> obsolete). >>> >>> IMHO, if the config parameter is not really useful, we should not have >> it. >>> >>> >>> -Matthias >>> >>> >>> On 2/10/17 9:27 AM, Damian Guy wrote: >>>> Gouzhang, Thanks for the clarification. Understood. >>>> >>>> Eno, you are correct if we just used commit interval then we wouldn't >>> need >>>> a KIP. But, then we'd have no way of turning it off. >>>> >>>> On Fri, 10 Feb 2017 at 17:14 Eno Thereska <eno.there...@gmail.com> >>> wrote: >>>> >>>>> A quick check: the checkpoint file is not new, we're just exposing a >>> knob >>>>> on when to set it, right? Would turning if off still do what it does >>> today >>>>> (i.e., write the checkpoint at the end when the user quits?) So it's >>> not a >>>>> new feature as such, I was only recommending we dial up the frequency >> by >>>>> default. With that option arguably we don't even need a KIP. >>>>> >>>>> Eno >>>>> >>>>> >>>>> >>>>>> On 10 Feb 2017, at 17:02, Guozhang Wang <wangg...@gmail.com> wrote: >>>>>> >>>>>> Damian, >>>>>> >>>>>> I was thinking if it is a new failure scenarios but as Eno pointed >> out >>> it >>>>>> was not. >>>>>> >>>>>> Another thing I was considering is if it has any impact for >>> incorporating >>>>>> KIP-98 to avoid duplicates: if there is a failure in the middle of a >>>>>> transaction, then upon recovery we cannot rely on the local state >> store >>>>>> file even if the checkpoint file exists, since the local state store >>> file >>>>>> may not be at the transaction boundaries. But since Streams will >> likely >>>>> to >>>>>> have EOS as an opt-in I think it is still worthwhile to add this >>> feature, >>>>>> just keeping in mind that when EOS is turned on it may cease to be >>>>>> effective. >>>>>> >>>>>> And yes, I'd suggest we leave the config value to be possibly >>>>> non-positive >>>>>> to indicate not turning on this feature for the reason above: if it >>> will >>>>>> not be effective then we want to leave it as an option to be turned >>> off. >>>>>> >>>>>> Guozhang >>>>>> >>>>>> >>>>>> On Fri, Feb 10, 2017 at 8:06 AM, Eno Thereska < >> eno.there...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> The overhead of writing to the checkpoint file should be much, much >>>>>>> smaller than the overall overhead of doing a commit, so I think >> tuning >>>>> the >>>>>>> commit time is sufficient to guide performance tradeoffs. >>>>>>> >>>>>>> Eno >>>>>>> >>>>>>>> On 10 Feb 2017, at 13:08, Dhwani Katagade < >>>>> dhwani_katag...@persistent.co >>>>>>> .in> wrote: >>>>>>>> >>>>>>>> May be for fine tuning the performance. >>>>>>>> Say we don't need the checkpointing and would like to gain the lil >>> bit >>>>>>> of performance improvement by turning it off. >>>>>>>> The trade off is between giving people control knobs vs >> complicating >>>>> the >>>>>>> complete set of knobs. >>>>>>>> >>>>>>>> -dk >>>>>>>> >>>>>>>> On Friday 10 February 2017 04:05 PM, Eno Thereska wrote: >>>>>>>>> I can't see why users would care to turn it off. >>>>>>>>> >>>>>>>>> Eno >>>>>>>>>> On 10 Feb 2017, at 10:29, Damian Guy <damian....@gmail.com> >> wrote: >>>>>>>>>> >>>>>>>>>> Hi Eno, >>>>>>>>>> >>>>>>>>>> Sounds good to me. The only reason i can think of is if we want >> to >>> be >>>>>>> able >>>>>>>>>> to turn it off. >>>>>>>>>> Gouzhang - thoughts? >>>>>>>>>> >>>>>>>>>> On Fri, 10 Feb 2017 at 10:28 Eno Thereska < >> eno.there...@gmail.com> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Question: if checkpointing is so cheap why not do it every >> commit >>>>>>>>>>> interval? That way we can get rid of this extra config variable >>> and >>>>>>> just >>>>>>>>>>> use the existing commit interval. >>>>>>>>>>> >>>>>>>>>>> Less tuning knobs. >>>>>>>>>>> >>>>>>>>>>> Eno >>>>>>>>>>> >>>>>>>>>>>> On 10 Feb 2017, at 09:27, Damian Guy <damian....@gmail.com> >>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Gouzhang, >>>>>>>>>>>> >>>>>>>>>>>> You've confused me. The failure scenarios you have described >> are >>>>> the >>>>>>> same >>>>>>>>>>>> as they are today. With the checkpoint files in place less data >>>>> will >>>>>>> be >>>>>>>>>>>> replayed, so there will be fewer duplicates. >>>>>>>>>>>> >>>>>>>>>>>> Are you saying you'd like the option to turn checkpointing off? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Damian >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <wangg...@gmail.com> >>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Eno, >>>>>>>>>>>>> >>>>>>>>>>>>> You are right, it is not a new scenario. >>>>>>>>>>>>> >>>>>>>>>>>>> Thinking a bit more on how we could incorporate KIP-98 in >>>>> Streams, I >>>>>>>>>>> feel >>>>>>>>>>>>> that if EOS is turned on inside Streams, then we probably >> cannot >>>>>>> always >>>>>>>>>>>>> resume from the checkpointed offsets as it is not guaranteed >> to >>> be >>>>>>>>>>>>> "consistent"; but since EOS may not be turned on by default >> this >>>>> is >>>>>>>>>>> still >>>>>>>>>>>>> worthwhile to add this feature I guess. >>>>>>>>>>>>> >>>>>>>>>>>>> About the default config values: I think the default value of >> 5 >>>>> min >>>>>>> is >>>>>>>>>>> OK >>>>>>>>>>>>> to me, since restoration is usually faster than normal >>> processing >>>>>>>>>>> (unless >>>>>>>>>>>>> your traffic was really high), about allowing it to be "turned >>>>> off" >>>>>>>>>>> with a >>>>>>>>>>>>> non-positive value: I feel there are still values to keep this >>>>> door >>>>>>>>>>> open as >>>>>>>>>>>>> in the future if EOS is turned on, people may just want to >> turn >>>>> off >>>>>>>>>>>>> checkpointing anyways, or there maybe other scenarios that we >>> have >>>>>>> not >>>>>>>>>>>>> realized yet. On the other hand, I would argue that it is less >>>>>>> likely >>>>>>>>>>> users >>>>>>>>>>>>> mistakenly set it to a non-positive value. >>>>>>>>>>>>> >>>>>>>>>>>>> Guozhang >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska < >>>>>>> eno.there...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Guozhang, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It seems to me we have the same semantics today. Are you >> saying >>>>>>> there >>>>>>>>>>> is >>>>>>>>>>>>> a >>>>>>>>>>>>>> new failure scenario? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Eno >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <wangg...@gmail.com> >>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> More specifically, here is my reasoning of failure cases, >> and >>>>>>> would >>>>>>>>>>>>> like >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> get your feedbacks: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *StreamTask* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For stream-task, the committing order is 1) flush state (may >>>>> send >>>>>>> more >>>>>>>>>>>>>>> records to changelog in producer), 2) flush producer, 3) >>> commit >>>>>>>>>>>>> upstream >>>>>>>>>>>>>>> offsets. My understanding is that the writing of the >>> checkpoint >>>>>>> file >>>>>>>>>>>>> will >>>>>>>>>>>>>>> between 2) and 3). So thatt he new order will be 1) flush >>> state, >>>>>>> 2) >>>>>>>>>>>>> flush >>>>>>>>>>>>>>> producer, 3) write checkpoint file (when necessary), 4) >> commit >>>>>>>>>>> upstream >>>>>>>>>>>>>>> offsets. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And we have a bunch of "changelog offsets" regarding the >>> state: >>>>> a) >>>>>>>>>>>>> offset >>>>>>>>>>>>>>> corresponding to the image of the persistent file, name it >>> point >>>>>>> A, b) >>>>>>>>>>>>>> log >>>>>>>>>>>>>>> end offset, name it offset B, c) checkpoint file recorded >>>>> offset, >>>>>>> name >>>>>>>>>>>>> it >>>>>>>>>>>>>>> offset C, d) offset corresponding to the current committed >>>>>>> upstream >>>>>>>>>>>>>> offset, >>>>>>>>>>>>>>> name it offset D. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Now let's talk about the failure cases: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If there is a crash between 1) and 2), then A > B = C = D. >> In >>>>> this >>>>>>>>>>>>> case, >>>>>>>>>>>>>> if >>>>>>>>>>>>>>> we restore, we will replay no logs at all since B = C while >>> the >>>>>>>>>>>>>> persistent >>>>>>>>>>>>>>> state file is actually "ahead of time", and we will start >>>>>>> reprocessing >>>>>>>>>>>>>>> since from the input offset corresponding to D = B < A and >>> hence >>>>>>> have >>>>>>>>>>>>>> some >>>>>>>>>>>>>>> duplicated, *which will be incorrect* if the update logic >>>>> involve >>>>>>>>>>>>> reading >>>>>>>>>>>>>>> the state store values as well (i.e. not a blind write), >> e.g. >>>>>>>>>>>>>> aggregations. >>>>>>>>>>>>>>> If there is a crash between 2) and 3), then A = B > C = D. >>> When >>>>> we >>>>>>>>>>>>>> restore, >>>>>>>>>>>>>>> we will replay from C -> B = A, and then start reprocessing >>> from >>>>>>> input >>>>>>>>>>>>>>> offset corresponding to D < A, and same issue applies as >>> above. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If there is a crash between 3) and 4), then A = B = C > D. >>> When >>>>> we >>>>>>>>>>>>>> restore, >>>>>>>>>>>>>>> we will not replay, and then start reprocessing from input >>>>> offset >>>>>>>>>>>>>>> corresponding to D < A, and same issue applies as above. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *StandbyTask* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We only do one operation today, which is 1) flush state, I >>> think >>>>>>> we >>>>>>>>>>>>> will >>>>>>>>>>>>>>> add the writing of the checkpoint file after it as step 2). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Failure cases again: offset A -> correspond to the image of >>> the >>>>>>> file, >>>>>>>>>>>>>>> offset B -> changelog end offset, offset C -> written as in >>> the >>>>>>>>>>>>>> checkpoint >>>>>>>>>>>>>>> file. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If there is a crash between 1) and 2), then B >= A > C (B >>> = A >>>>>>> because >>>>>>>>>>>>> we >>>>>>>>>>>>>>> are reading from changelog topic so A will never be greater >>> than >>>>>>> B), >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) and if this task resumes as a standby task, we will >> resume >>>>>>>>>>>>> restoration >>>>>>>>>>>>>>> from offset C, and a few duplicates from C -> A will be >>> applied >>>>>>> again >>>>>>>>>>>>> to >>>>>>>>>>>>>>> local state files, then continue from A -> B, *this is OK* >>> since >>>>>>> they >>>>>>>>>>>>> do >>>>>>>>>>>>>>> not incur any computations hence no side effects and are all >>>>>>>>>>>>> idempotent. >>>>>>>>>>>>>>> 2) and if this task resumes as a stream task, we will replay >>>>>>>>>>> changelogs >>>>>>>>>>>>>>> from C -> A, with duplicated updates, and then from A -> B. >>> This >>>>>>> is >>>>>>>>>>>>> also >>>>>>>>>>>>>> OK >>>>>>>>>>>>>>> for the same reason as above. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So it seems to me that this is not safe for a StreamTask, or >>>>>>> maybe the >>>>>>>>>>>>>>> writing of the checkpoint file in your mind is different? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang < >>>>>>> wangg...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> A quick question re: `We will add the above config >> parameter >>> to >>>>>>>>>>>>>>>> *StreamsConfig*. During *StreamTask#commit()*, >>>>>>>>>>> *StandbyTask#commit()*, >>>>>>>>>>>>>>>> and *GlobalUpdateStateTask#flushState()* we will check if >> the >>>>>>>>>>>>>> checkpoint >>>>>>>>>>>>>>>> interval has elapsed and write the checkpoint file.` >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Will the writing of the checkpoint file happen before the >>>>>>> flushing of >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> state manager? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax < >>>>>>>>>>>>> matth...@confluent.io >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But 5 min means, that we (in the worst case) need to reply >>>>> data >>>>>>> from >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> last 5 minutes to get the store ready. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So why not go with the min possible value of 30 seconds to >>>>>>> speed up >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>> process if the impact is negligible anyway? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What do you gain by being conservative? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote: >>>>>>>>>>>>>>>>>> Why shouldn't it be 5 minutes? ;-) >>>>>>>>>>>>>>>>>> It is a finger in the air number. Based on the testing i >>> did >>>>> it >>>>>>>>>>>>> shows >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> there isn't much, if any, overhead when checkpointing a >>>>> single >>>>>>>>>>> store >>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> commit interval. The default commit interval is 30 >> seconds, >>>>> so >>>>>>> it >>>>>>>>>>>>>> could >>>>>>>>>>>>>>>>>> possibly be set to that. However, i'd prefer to be a >> little >>>>>>>>>>>>>>>>> conservative so >>>>>>>>>>>>>>>>>> 5 minutes seemed reasonable. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll < >>>>> mich...@confluent.io >>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> Damian, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> could you elaborate briefly why the default value should >>> be >>>>> 5 >>>>>>>>>>>>>> minutes? >>>>>>>>>>>>>>>>>>> What are the considerations, assumptions, etc. that go >>> into >>>>>>>>>>> picking >>>>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>> value? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Right now, in the KIP and in this discussion, "5 mins" >>> looks >>>>>>> like >>>>>>>>>>> a >>>>>>>>>>>>>>>>> magic >>>>>>>>>>>>>>>>>>> number to me. :-) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -Michael >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy < >>>>>>> damian....@gmail.com >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and off >>> to >>>>>>> see >>>>>>>>>>>>> what >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> impact is. It appears that there is very little impact, >>> if >>>>>>> any. >>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>> numbers >>>>>>>>>>>>>>>>>>>> with checkpointing on actually look better, but that is >>>>>>> likely >>>>>>>>>>>>>> largely >>>>>>>>>>>>>>>>>>> due >>>>>>>>>>>>>>>>>>>> to external influences. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> In any case, i'm going to suggest we go with a default >>>>>>> checkpoint >>>>>>>>>>>>>>>>>>> interval >>>>>>>>>>>>>>>>>>>> of 5 minutes. I've update the KIP with this. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> commit every 10 seconds (no checkpoint) >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> checkpoint every 10 seconds (same as commit interval) >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754 >>>>>>>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec >>>>>>>>>>> source+store]: >>>>>>>>>>>>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy < >>>>> damian....@gmail.com >>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> Matthias, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Fair point. I'll update it the KIP. >>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax < >>>>>>>>>>>>> matth...@confluent.io >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> Damian, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I am not strict about it either. However, if there is >> no >>>>>>>>>>>>> advantage >>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>> disabling it, we might not want to allow it. This >> would >>>>>>> have the >>>>>>>>>>>>>>>>>>>>> advantage to guard users to accidentally switch it >> off. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote: >>>>>>>>>>>>>>>>>>>>>> Hi Matthias, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> It possibly doesn't make sense to disable it, but >> then >>>>> i'm >>>>>>> sure >>>>>>>>>>>>>>>>>>> someone >>>>>>>>>>>>>>>>>>>>>> will come up with a reason they don't want it! >>>>>>>>>>>>>>>>>>>>>> I'm happy to change it such that the checkpoint >>> interval >>>>>>> must >>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>> 0. >>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>> Damian >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax < >>>>>>>>>>>>>> matth...@confluent.io> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> Thanks Damian. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> One more question: "Checkpointing is disabled if the >>>>>>>>>>> checkpoint >>>>>>>>>>>>>>>>>>>> interval >>>>>>>>>>>>>>>>>>>>>>> is set to a value <=0." >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Does it make sense to disable check pointing? What's >>> the >>>>>>>>>>>>> tradeoff >>>>>>>>>>>>>>>>>>>> here? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote: >>>>>>>>>>>>>>>>>>>>>>>> Hi Matthias, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and >> try >>>>> and >>>>>>> work >>>>>>>>>>>>>> out >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>> sensible default. >>>>>>>>>>>>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of >>> the >>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms. >>>>>>>>>>>>>>>>>>>>>>>> But, that would also mean if you change the commit >>>>>>> interval - >>>>>>>>>>>>>> say >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>> lower >>>>>>>>>>>>>>>>>>>>>>>> it, then you might also need to change the >> checkpoint >>>>>>> setting >>>>>>>>>>>>>>>>> (i.e, >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> still only want to checkpoint every n minutes). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax < >>>>>>>>>>>>>>>>> matth...@confluent.io >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP Damian. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I am wondering about two things: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 1. what should be the default value for the new >>>>>>> parameter? >>>>>>>>>>>>>>>>>>>>>>>>> 2. why is the new parameter provided in ms? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> About (2): because >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> "the minimum checkpoint interval will be the value >>> of >>>>>>>>>>>>>>>>>>>>>>>>> commit.interval.ms. In effect the actual >> checkpoint >>>>>>>>>>> interval >>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>>> multiple of the commit interval" >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> it might be easier to just use an parameter that >> is >>>>>>>>>>>>>>>>>>>> "number-or-commit >>>>>>>>>>>>>>>>>>>>>>>>> intervals". >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the comments Eno. >>>>>>>>>>>>>>>>>>>>>>>>>> As for exactly once, i don't believe this matters >>> as >>>>>>> we are >>>>>>>>>>>>>> just >>>>>>>>>>>>>>>>>>>>>>>>> restoring >>>>>>>>>>>>>>>>>>>>>>>>>> the change-log, i.e, the result of the >> aggregations >>>>>>> that >>>>>>>>>>>>>>>>>>> previously >>>>>>>>>>>>>>>>>>>>> ran >>>>>>>>>>>>>>>>>>>>>>>>>> etc. So once initialized the state store will be >> in >>>>> the >>>>>>>>>>> same >>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>>>>> was before. >>>>>>>>>>>>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not >> ideal >>>>> as >>>>>>> the >>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> per >>>>>>>>>>>>>>>>>>>>>>>>>> kafka streams instance. So each instance would >> need >>>>> to >>>>>>>>>>> start >>>>>>>>>>>>>>>>>>> with a >>>>>>>>>>>>>>>>>>>>>>>>> unique >>>>>>>>>>>>>>>>>>>>>>>>>> id that is persistent. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>>>>>> Damian >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska < >>>>>>>>>>>>>>>>> eno.there...@gmail.com >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you >>>>>>> thought >>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>>>>> writing >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file? >>> That >>>>>>> would >>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>> advantage that all metadata continues to be >>> managed >>>>> by >>>>>>>>>>>>> Kafka, >>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>> well >>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would >> be >>> a >>>>>>> slower >>>>>>>>>>>>>>>>>>>> latency, >>>>>>>>>>>>>>>>>>>>>>>>> however >>>>>>>>>>>>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure >>> that >>>>>>> would >>>>>>>>>>>>> be >>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> show >>>>>>>>>>>>>>>>>>>>>>>>> stopper. >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>>>> Eno >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska < >>>>>>>>>>>>>> eno.there...@gmail.com >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will >>> reduce >>>>>>> the >>>>>>>>>>>>>> restore >>>>>>>>>>>>>>>>>>>>> time. >>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward, with exactly once and support >> for >>>>>>>>>>>>>> transactions >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> Kafka, I >>>>>>>>>>>>>>>>>>>>>>>>>>> believe we'll have to add some support for >> rolling >>>>>>> back >>>>>>>>>>>>>>>>>>>> checkpoints, >>>>>>>>>>>>>>>>>>>>>>>>> e.g., >>>>>>>>>>>>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be >> aware >>>>> of >>>>>>> that >>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>> ideally >>>>>>>>>>>>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>>>>> Eno >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy < >>>>>>>>>>>>> damian....@gmail.com> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start the discussion on >> KIP-116: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ >> confluence/display/KAFKA/KIP- >>>>>>>>>>>>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+ >> Configuration >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Damian >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> DISCLAIMER >>>>>>>> ========== >>>>>>>> This e-mail may contain privileged and confidential information >> which >>>>> is >>>>>>> the property of Persistent Systems Ltd. It is intended only for the >>> use >>>>> of >>>>>>> the individual or entity to which it is addressed. If you are not >> the >>>>>>> intended recipient, you are not authorized to read, retain, copy, >>> print, >>>>>>> distribute or use this message. If you have received this >>> communication >>>>> in >>>>>>> error, please notify the sender and delete all copies of this >> message. >>>>>>> Persistent Systems Ltd. does not accept any liability for virus >>> infected >>>>>>> mails. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> -- Guozhang >>>>> >>>>> >>>> >>> >>> >> > > > > -- > -- Guozhang