Gouzhang, Thanks for the clarification. Understood. Eno, you are correct if we just used commit interval then we wouldn't need a KIP. But, then we'd have no way of turning it off.
On Fri, 10 Feb 2017 at 17:14 Eno Thereska <eno.there...@gmail.com> wrote: > A quick check: the checkpoint file is not new, we're just exposing a knob > on when to set it, right? Would turning if off still do what it does today > (i.e., write the checkpoint at the end when the user quits?) So it's not a > new feature as such, I was only recommending we dial up the frequency by > default. With that option arguably we don't even need a KIP. > > Eno > > > > > On 10 Feb 2017, at 17:02, Guozhang Wang <wangg...@gmail.com> wrote: > > > > Damian, > > > > I was thinking if it is a new failure scenarios but as Eno pointed out it > > was not. > > > > Another thing I was considering is if it has any impact for incorporating > > KIP-98 to avoid duplicates: if there is a failure in the middle of a > > transaction, then upon recovery we cannot rely on the local state store > > file even if the checkpoint file exists, since the local state store file > > may not be at the transaction boundaries. But since Streams will likely > to > > have EOS as an opt-in I think it is still worthwhile to add this feature, > > just keeping in mind that when EOS is turned on it may cease to be > > effective. > > > > And yes, I'd suggest we leave the config value to be possibly > non-positive > > to indicate not turning on this feature for the reason above: if it will > > not be effective then we want to leave it as an option to be turned off. > > > > Guozhang > > > > > > On Fri, Feb 10, 2017 at 8:06 AM, Eno Thereska <eno.there...@gmail.com> > > wrote: > > > >> The overhead of writing to the checkpoint file should be much, much > >> smaller than the overall overhead of doing a commit, so I think tuning > the > >> commit time is sufficient to guide performance tradeoffs. > >> > >> Eno > >> > >>> On 10 Feb 2017, at 13:08, Dhwani Katagade < > dhwani_katag...@persistent.co > >> .in> wrote: > >>> > >>> May be for fine tuning the performance. > >>> Say we don't need the checkpointing and would like to gain the lil bit > >> of performance improvement by turning it off. > >>> The trade off is between giving people control knobs vs complicating > the > >> complete set of knobs. > >>> > >>> -dk > >>> > >>> On Friday 10 February 2017 04:05 PM, Eno Thereska wrote: > >>>> I can't see why users would care to turn it off. > >>>> > >>>> Eno > >>>>> On 10 Feb 2017, at 10:29, Damian Guy <damian....@gmail.com> wrote: > >>>>> > >>>>> Hi Eno, > >>>>> > >>>>> Sounds good to me. The only reason i can think of is if we want to be > >> able > >>>>> to turn it off. > >>>>> Gouzhang - thoughts? > >>>>> > >>>>> On Fri, 10 Feb 2017 at 10:28 Eno Thereska <eno.there...@gmail.com> > >> wrote: > >>>>> > >>>>>> Question: if checkpointing is so cheap why not do it every commit > >>>>>> interval? That way we can get rid of this extra config variable and > >> just > >>>>>> use the existing commit interval. > >>>>>> > >>>>>> Less tuning knobs. > >>>>>> > >>>>>> Eno > >>>>>> > >>>>>>> On 10 Feb 2017, at 09:27, Damian Guy <damian....@gmail.com> wrote: > >>>>>>> > >>>>>>> Gouzhang, > >>>>>>> > >>>>>>> You've confused me. The failure scenarios you have described are > the > >> same > >>>>>>> as they are today. With the checkpoint files in place less data > will > >> be > >>>>>>> replayed, so there will be fewer duplicates. > >>>>>>> > >>>>>>> Are you saying you'd like the option to turn checkpointing off? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Damian > >>>>>>> > >>>>>>> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <wangg...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>>> Eno, > >>>>>>>> > >>>>>>>> You are right, it is not a new scenario. > >>>>>>>> > >>>>>>>> Thinking a bit more on how we could incorporate KIP-98 in > Streams, I > >>>>>> feel > >>>>>>>> that if EOS is turned on inside Streams, then we probably cannot > >> always > >>>>>>>> resume from the checkpointed offsets as it is not guaranteed to be > >>>>>>>> "consistent"; but since EOS may not be turned on by default this > is > >>>>>> still > >>>>>>>> worthwhile to add this feature I guess. > >>>>>>>> > >>>>>>>> About the default config values: I think the default value of 5 > min > >> is > >>>>>> OK > >>>>>>>> to me, since restoration is usually faster than normal processing > >>>>>> (unless > >>>>>>>> your traffic was really high), about allowing it to be "turned > off" > >>>>>> with a > >>>>>>>> non-positive value: I feel there are still values to keep this > door > >>>>>> open as > >>>>>>>> in the future if EOS is turned on, people may just want to turn > off > >>>>>>>> checkpointing anyways, or there maybe other scenarios that we have > >> not > >>>>>>>> realized yet. On the other hand, I would argue that it is less > >> likely > >>>>>> users > >>>>>>>> mistakenly set it to a non-positive value. > >>>>>>>> > >>>>>>>> Guozhang > >>>>>>>> > >>>>>>>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska < > >> eno.there...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Guozhang, > >>>>>>>>> > >>>>>>>>> It seems to me we have the same semantics today. Are you saying > >> there > >>>>>> is > >>>>>>>> a > >>>>>>>>> new failure scenario? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Eno > >>>>>>>>> > >>>>>>>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <wangg...@gmail.com> > >> wrote: > >>>>>>>>>> > >>>>>>>>>> More specifically, here is my reasoning of failure cases, and > >> would > >>>>>>>> like > >>>>>>>>> to > >>>>>>>>>> get your feedbacks: > >>>>>>>>>> > >>>>>>>>>> *StreamTask* > >>>>>>>>>> > >>>>>>>>>> For stream-task, the committing order is 1) flush state (may > send > >> more > >>>>>>>>>> records to changelog in producer), 2) flush producer, 3) commit > >>>>>>>> upstream > >>>>>>>>>> offsets. My understanding is that the writing of the checkpoint > >> file > >>>>>>>> will > >>>>>>>>>> between 2) and 3). So thatt he new order will be 1) flush state, > >> 2) > >>>>>>>> flush > >>>>>>>>>> producer, 3) write checkpoint file (when necessary), 4) commit > >>>>>> upstream > >>>>>>>>>> offsets. > >>>>>>>>>> > >>>>>>>>>> And we have a bunch of "changelog offsets" regarding the state: > a) > >>>>>>>> offset > >>>>>>>>>> corresponding to the image of the persistent file, name it point > >> A, b) > >>>>>>>>> log > >>>>>>>>>> end offset, name it offset B, c) checkpoint file recorded > offset, > >> name > >>>>>>>> it > >>>>>>>>>> offset C, d) offset corresponding to the current committed > >> upstream > >>>>>>>>> offset, > >>>>>>>>>> name it offset D. > >>>>>>>>>> > >>>>>>>>>> Now let's talk about the failure cases: > >>>>>>>>>> > >>>>>>>>>> If there is a crash between 1) and 2), then A > B = C = D. In > this > >>>>>>>> case, > >>>>>>>>> if > >>>>>>>>>> we restore, we will replay no logs at all since B = C while the > >>>>>>>>> persistent > >>>>>>>>>> state file is actually "ahead of time", and we will start > >> reprocessing > >>>>>>>>>> since from the input offset corresponding to D = B < A and hence > >> have > >>>>>>>>> some > >>>>>>>>>> duplicated, *which will be incorrect* if the update logic > involve > >>>>>>>> reading > >>>>>>>>>> the state store values as well (i.e. not a blind write), e.g. > >>>>>>>>> aggregations. > >>>>>>>>>> If there is a crash between 2) and 3), then A = B > C = D. When > we > >>>>>>>>> restore, > >>>>>>>>>> we will replay from C -> B = A, and then start reprocessing from > >> input > >>>>>>>>>> offset corresponding to D < A, and same issue applies as above. > >>>>>>>>>> > >>>>>>>>>> If there is a crash between 3) and 4), then A = B = C > D. When > we > >>>>>>>>> restore, > >>>>>>>>>> we will not replay, and then start reprocessing from input > offset > >>>>>>>>>> corresponding to D < A, and same issue applies as above. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> *StandbyTask* > >>>>>>>>>> > >>>>>>>>>> We only do one operation today, which is 1) flush state, I think > >> we > >>>>>>>> will > >>>>>>>>>> add the writing of the checkpoint file after it as step 2). > >>>>>>>>>> > >>>>>>>>>> Failure cases again: offset A -> correspond to the image of the > >> file, > >>>>>>>>>> offset B -> changelog end offset, offset C -> written as in the > >>>>>>>>> checkpoint > >>>>>>>>>> file. > >>>>>>>>>> > >>>>>>>>>> If there is a crash between 1) and 2), then B >= A > C (B >= A > >> because > >>>>>>>> we > >>>>>>>>>> are reading from changelog topic so A will never be greater than > >> B), > >>>>>>>>>> > >>>>>>>>>> 1) and if this task resumes as a standby task, we will resume > >>>>>>>> restoration > >>>>>>>>>> from offset C, and a few duplicates from C -> A will be applied > >> again > >>>>>>>> to > >>>>>>>>>> local state files, then continue from A -> B, *this is OK* since > >> they > >>>>>>>> do > >>>>>>>>>> not incur any computations hence no side effects and are all > >>>>>>>> idempotent. > >>>>>>>>>> 2) and if this task resumes as a stream task, we will replay > >>>>>> changelogs > >>>>>>>>>> from C -> A, with duplicated updates, and then from A -> B. This > >> is > >>>>>>>> also > >>>>>>>>> OK > >>>>>>>>>> for the same reason as above. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> So it seems to me that this is not safe for a StreamTask, or > >> maybe the > >>>>>>>>>> writing of the checkpoint file in your mind is different? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Guozhang > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang < > >> wangg...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>>>> A quick question re: `We will add the above config parameter to > >>>>>>>>>>> *StreamsConfig*. During *StreamTask#commit()*, > >>>>>> *StandbyTask#commit()*, > >>>>>>>>>>> and *GlobalUpdateStateTask#flushState()* we will check if the > >>>>>>>>> checkpoint > >>>>>>>>>>> interval has elapsed and write the checkpoint file.` > >>>>>>>>>>> > >>>>>>>>>>> Will the writing of the checkpoint file happen before the > >> flushing of > >>>>>>>>> the > >>>>>>>>>>> state manager? > >>>>>>>>>>> > >>>>>>>>>>> Guozhang > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax < > >>>>>>>> matth...@confluent.io > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> But 5 min means, that we (in the worst case) need to reply > data > >> from > >>>>>>>>> the > >>>>>>>>>>>> last 5 minutes to get the store ready. > >>>>>>>>>>>> > >>>>>>>>>>>> So why not go with the min possible value of 30 seconds to > >> speed up > >>>>>>>>> this > >>>>>>>>>>>> process if the impact is negligible anyway? > >>>>>>>>>>>> > >>>>>>>>>>>> What do you gain by being conservative? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -Matthias > >>>>>>>>>>>> > >>>>>>>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote: > >>>>>>>>>>>>> Why shouldn't it be 5 minutes? ;-) > >>>>>>>>>>>>> It is a finger in the air number. Based on the testing i did > it > >>>>>>>> shows > >>>>>>>>>>>> that > >>>>>>>>>>>>> there isn't much, if any, overhead when checkpointing a > single > >>>>>> store > >>>>>>>>> on > >>>>>>>>>>>> the > >>>>>>>>>>>>> commit interval. The default commit interval is 30 seconds, > so > >> it > >>>>>>>>> could > >>>>>>>>>>>>> possibly be set to that. However, i'd prefer to be a little > >>>>>>>>>>>> conservative so > >>>>>>>>>>>>> 5 minutes seemed reasonable. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll < > mich...@confluent.io > >>> > >>>>>>>>> wrote: > >>>>>>>>>>>>>> Damian, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> could you elaborate briefly why the default value should be > 5 > >>>>>>>>> minutes? > >>>>>>>>>>>>>> What are the considerations, assumptions, etc. that go into > >>>>>> picking > >>>>>>>>>>>> this > >>>>>>>>>>>>>> value? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Right now, in the KIP and in this discussion, "5 mins" looks > >> like > >>>>>> a > >>>>>>>>>>>> magic > >>>>>>>>>>>>>> number to me. :-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -Michael > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy < > >> damian....@gmail.com > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and off to > >> see > >>>>>>>> what > >>>>>>>>>>>> the > >>>>>>>>>>>>>>> impact is. It appears that there is very little impact, if > >> any. > >>>>>>>> The > >>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>>> with checkpointing on actually look better, but that is > >> likely > >>>>>>>>> largely > >>>>>>>>>>>>>> due > >>>>>>>>>>>>>>> to external influences. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> In any case, i'm going to suggest we go with a default > >> checkpoint > >>>>>>>>>>>>>> interval > >>>>>>>>>>>>>>> of 5 minutes. I've update the KIP with this. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> commit every 10 seconds (no checkpoint) > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> checkpoint every 10 seconds (same as commit interval) > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754 > >>>>>>>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec > >>>>>> source+store]: > >>>>>>>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy < > damian....@gmail.com > >>> > >>>>>>>>> wrote: > >>>>>>>>>>>>>>>> Matthias, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Fair point. I'll update it the KIP. > >>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax < > >>>>>>>> matth...@confluent.io > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> Damian, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am not strict about it either. However, if there is no > >>>>>>>> advantage > >>>>>>>>> in > >>>>>>>>>>>>>>>> disabling it, we might not want to allow it. This would > >> have the > >>>>>>>>>>>>>>>> advantage to guard users to accidentally switch it off. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -Matthias > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote: > >>>>>>>>>>>>>>>>> Hi Matthias, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> It possibly doesn't make sense to disable it, but then > i'm > >> sure > >>>>>>>>>>>>>> someone > >>>>>>>>>>>>>>>>> will come up with a reason they don't want it! > >>>>>>>>>>>>>>>>> I'm happy to change it such that the checkpoint interval > >> must > >>>>>>>> be > > >>>>>>>>>>>> 0. > >>>>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>>>> Damian > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax < > >>>>>>>>> matth...@confluent.io> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> Thanks Damian. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> One more question: "Checkpointing is disabled if the > >>>>>> checkpoint > >>>>>>>>>>>>>>> interval > >>>>>>>>>>>>>>>>>> is set to a value <=0." > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Does it make sense to disable check pointing? What's the > >>>>>>>> tradeoff > >>>>>>>>>>>>>>> here? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> -Matthias > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote: > >>>>>>>>>>>>>>>>>>> Hi Matthias, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for the comments. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and try > and > >> work > >>>>>>>>> out > >>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>> sensible default. > >>>>>>>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of the > >>>>>>>>>>>>>>>>>> commit.interval.ms. > >>>>>>>>>>>>>>>>>>> But, that would also mean if you change the commit > >> interval - > >>>>>>>>> say > >>>>>>>>>>>>>> you > >>>>>>>>>>>>>>>>>> lower > >>>>>>>>>>>>>>>>>>> it, then you might also need to change the checkpoint > >> setting > >>>>>>>>>>>> (i.e, > >>>>>>>>>>>>>>> you > >>>>>>>>>>>>>>>>>>> still only want to checkpoint every n minutes). > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax < > >>>>>>>>>>>> matth...@confluent.io > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> Thanks for the KIP Damian. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I am wondering about two things: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 1. what should be the default value for the new > >> parameter? > >>>>>>>>>>>>>>>>>>>> 2. why is the new parameter provided in ms? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> About (2): because > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> "the minimum checkpoint interval will be the value of > >>>>>>>>>>>>>>>>>>>> commit.interval.ms. In effect the actual checkpoint > >>>>>> interval > >>>>>>>>>>>> will > >>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>> multiple of the commit interval" > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> it might be easier to just use an parameter that is > >>>>>>>>>>>>>>> "number-or-commit > >>>>>>>>>>>>>>>>>>>> intervals". > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> -Matthias > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote: > >>>>>>>>>>>>>>>>>>>>> Thanks for the comments Eno. > >>>>>>>>>>>>>>>>>>>>> As for exactly once, i don't believe this matters as > >> we are > >>>>>>>>> just > >>>>>>>>>>>>>>>>>>>> restoring > >>>>>>>>>>>>>>>>>>>>> the change-log, i.e, the result of the aggregations > >> that > >>>>>>>>>>>>>> previously > >>>>>>>>>>>>>>>> ran > >>>>>>>>>>>>>>>>>>>>> etc. So once initialized the state store will be in > the > >>>>>> same > >>>>>>>>>>>>>> state > >>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>> it > >>>>>>>>>>>>>>>>>>>>> was before. > >>>>>>>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not ideal > as > >> the > >>>>>>>>> state > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>> per > >>>>>>>>>>>>>>>>>>>>> kafka streams instance. So each instance would need > to > >>>>>> start > >>>>>>>>>>>>>> with a > >>>>>>>>>>>>>>>>>>>> unique > >>>>>>>>>>>>>>>>>>>>> id that is persistent. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>>>>>>>> Damian > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska < > >>>>>>>>>>>> eno.there...@gmail.com > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you > >> thought > >>>>>>>> about > >>>>>>>>>>>>>>>> writing > >>>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file? That > >> would > >>>>>>>>> have > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>> advantage that all metadata continues to be managed > by > >>>>>>>> Kafka, > >>>>>>>>>>>> as > >>>>>>>>>>>>>>>> well > >>>>>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would be a > >> slower > >>>>>>>>>>>>>>> latency, > >>>>>>>>>>>>>>>>>>>> however > >>>>>>>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure that > >> would > >>>>>>>> be > >>>>>>>>> a > >>>>>>>>>>>>>>> show > >>>>>>>>>>>>>>>>>>>> stopper. > >>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>> Eno > >>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska < > >>>>>>>>> eno.there...@gmail.com > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will reduce > >> the > >>>>>>>>> restore > >>>>>>>>>>>>>>>> time. > >>>>>>>>>>>>>>>>>>>>>> Looking forward, with exactly once and support for > >>>>>>>>> transactions > >>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>> Kafka, I > >>>>>>>>>>>>>>>>>>>>>> believe we'll have to add some support for rolling > >> back > >>>>>>>>>>>>>>> checkpoints, > >>>>>>>>>>>>>>>>>>>> e.g., > >>>>>>>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be aware > of > >> that > >>>>>>>>> and > >>>>>>>>>>>>>>>> ideally > >>>>>>>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP. > >>>>>>>>>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>>>>>>>>> Eno > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy < > >>>>>>>> damian....@gmail.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start the discussion on KIP-116: > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > >>>>>>>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+Configuration > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>> Damian > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> -- Guozhang > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> -- Guozhang > >>>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> -- Guozhang > >>>>>>>> > >>>>>> > >>> > >>> > >>> DISCLAIMER > >>> ========== > >>> This e-mail may contain privileged and confidential information which > is > >> the property of Persistent Systems Ltd. It is intended only for the use > of > >> the individual or entity to which it is addressed. If you are not the > >> intended recipient, you are not authorized to read, retain, copy, print, > >> distribute or use this message. If you have received this communication > in > >> error, please notify the sender and delete all copies of this message. > >> Persistent Systems Ltd. does not accept any liability for virus infected > >> mails. > >>> > >> > >> > > > > > > -- > > -- Guozhang > >