Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Eno Thereska Fri, 10 Feb 2017 02:28:32 -0800

Question: if checkpointing is so cheap why not do it every commit interval? 
That way we can get rid of this extra config variable and just use the existing 
commit interval.


Less tuning knobs.

Eno

> On 10 Feb 2017, at 09:27, Damian Guy <[email protected]> wrote:
> 
> Gouzhang,
> 
> You've confused me. The failure scenarios you have described are the same
> as they are today. With the checkpoint files in place less data will be
> replayed, so there will be fewer duplicates.
> 
> Are you saying you'd like the option to turn checkpointing off?
> 
> Thanks,
> Damian
> 
> On Thu, 9 Feb 2017 at 21:55 Guozhang Wang <[email protected]> wrote:
> 
>> Eno,
>> 
>> You are right, it is not a new scenario.
>> 
>> Thinking a bit more on how we could incorporate KIP-98 in Streams, I feel
>> that if EOS is turned on inside Streams, then we probably cannot always
>> resume from the checkpointed offsets as it is not guaranteed to be
>> "consistent"; but since EOS may not be turned on by default this is still
>> worthwhile to add this feature I guess.
>> 
>> About the default config values: I think the default value of 5 min is OK
>> to me, since restoration is usually faster than normal processing (unless
>> your traffic was really high), about allowing it to be "turned off" with a
>> non-positive value: I feel there are still values to keep this door open as
>> in the future if EOS is turned on, people may just want to turn off
>> checkpointing anyways, or there maybe other scenarios that we have not
>> realized yet. On the other hand, I would argue that it is less likely users
>> mistakenly set it to a non-positive value.
>> 
>> Guozhang
>> 
>> On Thu, Feb 9, 2017 at 1:03 PM, Eno Thereska <[email protected]>
>> wrote:
>> 
>>> Hi Guozhang,
>>> 
>>> It seems to me we have the same semantics today. Are you saying there is
>> a
>>> new failure scenario?
>>> 
>>> Thanks,
>>> Eno
>>> 
>>>> On 9 Feb 2017, at 19:42, Guozhang Wang <[email protected]> wrote:
>>>> 
>>>> More specifically, here is my reasoning of failure cases, and would
>> like
>>> to
>>>> get your feedbacks:
>>>> 
>>>> *StreamTask*
>>>> 
>>>> For stream-task, the committing order is 1) flush state (may send more
>>>> records to changelog in producer), 2) flush producer, 3) commit
>> upstream
>>>> offsets. My understanding is that the writing of the checkpoint file
>> will
>>>> between 2) and 3). So thatt he new order will be 1) flush state, 2)
>> flush
>>>> producer, 3) write checkpoint file (when necessary), 4) commit upstream
>>>> offsets.
>>>> 
>>>> And we have a bunch of "changelog offsets" regarding the state: a)
>> offset
>>>> corresponding to the image of the persistent file, name it point A, b)
>>> log
>>>> end offset, name it offset B, c) checkpoint file recorded offset, name
>> it
>>>> offset C, d) offset corresponding to the current committed upstream
>>> offset,
>>>> name it offset D.
>>>> 
>>>> Now let's talk about the failure cases:
>>>> 
>>>> If there is a crash between 1) and 2), then A > B = C = D. In this
>> case,
>>> if
>>>> we restore, we will replay no logs at all since B = C while the
>>> persistent
>>>> state file is actually "ahead of time", and we will start reprocessing
>>>> since from the input offset corresponding to D = B < A and hence have
>>> some
>>>> duplicated, *which will be incorrect* if the update logic involve
>> reading
>>>> the state store values as well (i.e. not a blind write), e.g.
>>> aggregations.
>>>> 
>>>> If there is a crash between 2) and 3), then A = B > C = D. When we
>>> restore,
>>>> we will replay from C -> B = A, and then start reprocessing from input
>>>> offset corresponding to D < A, and same issue applies as above.
>>>> 
>>>> If there is a crash between 3) and 4), then A = B = C > D. When we
>>> restore,
>>>> we will not replay, and then start reprocessing from input offset
>>>> corresponding to D < A, and same issue applies as above.
>>>> 
>>>> 
>>>> *StandbyTask*
>>>> 
>>>> We only do one operation today, which is 1) flush state, I think we
>> will
>>>> add the writing of the checkpoint file after it as step 2).
>>>> 
>>>> Failure cases again: offset A -> correspond to the image of the file,
>>>> offset B -> changelog end offset, offset C -> written as in the
>>> checkpoint
>>>> file.
>>>> 
>>>> If there is a crash between 1) and 2), then B >= A > C (B >= A because
>> we
>>>> are reading from changelog topic so A will never be greater than B),
>>>> 
>>>> 1) and if this task resumes as a standby task, we will resume
>> restoration
>>>> from offset C, and a few duplicates from C -> A will be applied again
>> to
>>>> local state files, then continue from A -> B, *this is OK* since they
>> do
>>>> not incur any computations hence no side effects and are all
>> idempotent.
>>>> 
>>>> 2) and if this task resumes as a stream task, we will replay changelogs
>>>> from C -> A, with duplicated updates, and then from A -> B. This is
>> also
>>> OK
>>>> for the same reason as above.
>>>> 
>>>> 
>>>> 
>>>> So it seems to me that this is not safe for a StreamTask, or maybe the
>>>> writing of the checkpoint file in your mind is different?
>>>> 
>>>> 
>>>> Guozhang
>>>> 
>>>> 
>>>> 
>>>> On Thu, Feb 9, 2017 at 11:02 AM, Guozhang Wang <[email protected]>
>>> wrote:
>>>> 
>>>>> A quick question re: `We will add the above config parameter to
>>>>> *StreamsConfig*. During *StreamTask#commit()*, *StandbyTask#commit()*,
>>>>> and *GlobalUpdateStateTask#flushState()* we will check if the
>>> checkpoint
>>>>> interval has elapsed and write the checkpoint file.`
>>>>> 
>>>>> Will the writing of the checkpoint file happen before the flushing of
>>> the
>>>>> state manager?
>>>>> 
>>>>> Guozhang
>>>>> 
>>>>> 
>>>>> On Thu, Feb 9, 2017 at 10:48 AM, Matthias J. Sax <
>> [email protected]
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> But 5 min means, that we (in the worst case) need to reply data from
>>> the
>>>>>> last 5 minutes to get the store ready.
>>>>>> 
>>>>>> So why not go with the min possible value of 30 seconds to speed up
>>> this
>>>>>> process if the impact is negligible anyway?
>>>>>> 
>>>>>> What do you gain by being conservative?
>>>>>> 
>>>>>> 
>>>>>> -Matthias
>>>>>> 
>>>>>> On 2/9/17 2:54 AM, Damian Guy wrote:
>>>>>>> Why shouldn't it be 5 minutes? ;-)
>>>>>>> It is a finger in the air number. Based on the testing i did it
>> shows
>>>>>> that
>>>>>>> there isn't much, if any, overhead when checkpointing a single store
>>> on
>>>>>> the
>>>>>>> commit interval. The default commit interval is 30 seconds, so it
>>> could
>>>>>>> possibly be set to that. However, i'd prefer to be a little
>>>>>> conservative so
>>>>>>> 5 minutes seemed reasonable.
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, 9 Feb 2017 at 10:25 Michael Noll <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>>> Damian,
>>>>>>>> 
>>>>>>>> could you elaborate briefly why the default value should be 5
>>> minutes?
>>>>>>>> What are the considerations, assumptions, etc. that go into picking
>>>>>> this
>>>>>>>> value?
>>>>>>>> 
>>>>>>>> Right now, in the KIP and in this discussion, "5 mins" looks like a
>>>>>> magic
>>>>>>>> number to me. :-)
>>>>>>>> 
>>>>>>>> -Michael
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 9, 2017 at 11:03 AM, Damian Guy <[email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I've ran the SimpleBenchmark with checkpoint on and off to see
>> what
>>>>>> the
>>>>>>>>> impact is. It appears that there is very little impact, if any.
>> The
>>>>>>>> numbers
>>>>>>>>> with checkpointing on actually look better, but that is likely
>>> largely
>>>>>>>> due
>>>>>>>>> to external influences.
>>>>>>>>> 
>>>>>>>>> In any case, i'm going to suggest we go with a default checkpoint
>>>>>>>> interval
>>>>>>>>> of 5 minutes. I've update the KIP with this.
>>>>>>>>> 
>>>>>>>>> commit every 10 seconds (no checkpoint)
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/34798/287372.83751939767/29.570664980746017
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/35942/278226.0308274442/28.62945857214401
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/34677/288375.58035585546/29.673847218617528
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/31192/320595.02436522185/32.98922800718133
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> checkpoint every 10 seconds (same as commit interval)
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/36997/270292.185852907/27.81306592426413
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/32087/311652.69423754164/32.069062237043035
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/32895/303997.5680194558/31.281349749202004
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/33476/298721.4720994145/30.738439479029754
>>>>>>>>> Streams Performance [records/latency/rec-sec/MB-sec source+store]:
>>>>>>>>> 10000000/33196/301241.1133871551/30.99771056753826
>>>>>>>>> 
>>>>>>>>> On Wed, 8 Feb 2017 at 09:02 Damian Guy <[email protected]>
>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Matthias,
>>>>>>>>>> 
>>>>>>>>>> Fair point. I'll update it the KIP.
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> On Wed, 8 Feb 2017 at 05:49 Matthias J. Sax <
>> [email protected]
>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Damian,
>>>>>>>>>> 
>>>>>>>>>> I am not strict about it either. However, if there is no
>> advantage
>>> in
>>>>>>>>>> disabling it, we might not want to allow it. This would have the
>>>>>>>>>> advantage to guard users to accidentally switch it off.
>>>>>>>>>> 
>>>>>>>>>> -Matthias
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 2/3/17 2:03 AM, Damian Guy wrote:
>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>> 
>>>>>>>>>>> It possibly doesn't make sense to disable it, but then i'm sure
>>>>>>>> someone
>>>>>>>>>>> will come up with a reason they don't want it!
>>>>>>>>>>> I'm happy to change it such that the checkpoint interval must
>> be >
>>>>>> 0.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Damian
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <
>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks Damian.
>>>>>>>>>>>> 
>>>>>>>>>>>> One more question: "Checkpointing is disabled if the checkpoint
>>>>>>>>> interval
>>>>>>>>>>>> is set to a value <=0."
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Does it make sense to disable check pointing? What's the
>> tradeoff
>>>>>>>>> here?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 2/2/17 1:51 AM, Damian Guy wrote:
>>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the comments.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. TBD - i need to do some performance tests and try and work
>>> out
>>>>>> a
>>>>>>>>>>>>> sensible default.
>>>>>>>>>>>>> 2. Yes, you are correct. It could be a multiple of the
>>>>>>>>>>>> commit.interval.ms.
>>>>>>>>>>>>> But, that would also mean if you change the commit interval -
>>> say
>>>>>>>> you
>>>>>>>>>>>> lower
>>>>>>>>>>>>> it, then you might also need to change the checkpoint setting
>>>>>> (i.e,
>>>>>>>>> you
>>>>>>>>>>>>> still only want to checkpoint every n minutes).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <
>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the KIP Damian.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am wondering about two things:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. what should be the default value for the new parameter?
>>>>>>>>>>>>>> 2. why is the new parameter provided in ms?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> About (2): because
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "the minimum checkpoint interval will be the value of
>>>>>>>>>>>>>> commit.interval.ms. In effect the actual checkpoint interval
>>>>>> will
>>>>>>>>> be
>>>>>>>>>> a
>>>>>>>>>>>>>> multiple of the commit interval"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> it might be easier to just use an parameter that is
>>>>>>>>> "number-or-commit
>>>>>>>>>>>>>> intervals".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote:
>>>>>>>>>>>>>>> Thanks for the comments Eno.
>>>>>>>>>>>>>>> As for exactly once, i don't believe this matters as we are
>>> just
>>>>>>>>>>>>>> restoring
>>>>>>>>>>>>>>> the change-log, i.e, the result of the aggregations that
>>>>>>>> previously
>>>>>>>>>> ran
>>>>>>>>>>>>>>> etc. So once initialized the state store will be in the same
>>>>>>>> state
>>>>>>>>> as
>>>>>>>>>>>> it
>>>>>>>>>>>>>>> was before.
>>>>>>>>>>>>>>> Having the checkpoint in a kafka topic is not ideal as the
>>> state
>>>>>>>> is
>>>>>>>>>> per
>>>>>>>>>>>>>>> kafka streams instance. So each instance would need to start
>>>>>>>> with a
>>>>>>>>>>>>>> unique
>>>>>>>>>>>>>>> id that is persistent.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <
>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As a follow up to my previous comment, have you thought
>> about
>>>>>>>>>> writing
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> checkpoint to a topic instead of a local file? That would
>>> have
>>>>>>>> the
>>>>>>>>>>>>>>>> advantage that all metadata continues to be managed by
>> Kafka,
>>>>>> as
>>>>>>>>>> well
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> fit with EoS. The potential disadvantage would be a slower
>>>>>>>>> latency,
>>>>>>>>>>>>>> however
>>>>>>>>>>>>>>>> if it is periodic as you mention, I'm not sure that would
>> be
>>> a
>>>>>>>>> show
>>>>>>>>>>>>>> stopper.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Eno
>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <
>>> [email protected]
>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks Damian, this is a good idea and will reduce the
>>> restore
>>>>>>>>>> time.
>>>>>>>>>>>>>>>> Looking forward, with exactly once and support for
>>> transactions
>>>>>>>> in
>>>>>>>>>>>>>> Kafka, I
>>>>>>>>>>>>>>>> believe we'll have to add some support for rolling back
>>>>>>>>> checkpoints,
>>>>>>>>>>>>>> e.g.,
>>>>>>>>>>>>>>>> when a transaction is aborted. We need to be aware of that
>>> and
>>>>>>>>>> ideally
>>>>>>>>>>>>>>>> anticipate a bit those needs in the KIP.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Eno
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <
>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I would like to start the discussion on KIP-116:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>> 116+-+Add+State+Store+Checkpoint+Interval+Configuration
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Damian
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> -- Guozhang
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> -- Guozhang
>>> 
>>> 
>> 
>> 
>> --
>> -- Guozhang
>>

Re: [DISCUSS] KIP-116 - Add State Store Checkpoint Interval Configuration

Reply via email to