Makes sense, thanks.
Eno
> On 6 Feb 2017, at 15:01, Damian Guy <damian....@gmail.com> wrote:
>
> Hi Eno,
>
> The state is on local disk, so having the checkpoint in a topic won't help.
> If the host fails permanently, then all of the local state is gone.
> Starting on another host requires restoring from the earliest offset.
>
> Thanks,
> Damian
>
> On Mon, 6 Feb 2017 at 14:58 Eno Thereska <eno.there...@gmail.com> wrote:
>
>> Hi Damian,
>>
>> I am trying to figure out if this handles a common enough failure
>> scenario. It seems to me this handles transient failures: a server with an
>> instance fails, then comes back up shortly and the instance recovers
>> quickly by reading the checkpoint file.
>>
>> Permanent failures, where the server fails and the instance is migrated
>> onto another server are not helped since the checkpoint file is lost with
>> the server down. Even if the server eventually comes up (transient failure,
>> but instance has migrated), the instance would have migrated to another
>> server, and it doesn't help that we have a checkpoint file locally.
>>
>> I was thinking a topic-based implementation would handle all scenarios.
>> What am I missing? I'm basically worried that the file-based implementation
>> addresses a niche problem, but can be convinced otherwise.
>>
>> Thanks
>> Eno
>>
>>
>>
>>> On 3 Feb 2017, at 10:03, Damian Guy <damian....@gmail.com> wrote:
>>>
>>> Hi Matthias,
>>>
>>> It possibly doesn't make sense to disable it, but then i'm sure someone
>>> will come up with a reason they don't want it!
>>> I'm happy to change it such that the checkpoint interval must be > 0.
>>>
>>> Cheers,
>>> Damian
>>>
>>> On Fri, 3 Feb 2017 at 01:29 Matthias J. Sax <matth...@confluent.io>
>> wrote:
>>>
>>>> Thanks Damian.
>>>>
>>>> One more question: "Checkpointing is disabled if the checkpoint interval
>>>> is set to a value <=0."
>>>>
>>>>
>>>> Does it make sense to disable check pointing? What's the tradeoff here?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 2/2/17 1:51 AM, Damian Guy wrote:
>>>>> Hi Matthias,
>>>>>
>>>>> Thanks for the comments.
>>>>>
>>>>> 1. TBD - i need to do some performance tests and try and work out a
>>>>> sensible default.
>>>>> 2. Yes, you are correct. It could be a multiple of the
>>>> commit.interval.ms.
>>>>> But, that would also mean if you change the commit interval - say you
>>>> lower
>>>>> it, then you might also need to change the checkpoint setting (i.e, you
>>>>> still only want to checkpoint every n minutes).
>>>>>
>>>>> On Wed, 1 Feb 2017 at 23:46 Matthias J. Sax <matth...@confluent.io>
>>>> wrote:
>>>>>
>>>>>> Thanks for the KIP Damian.
>>>>>>
>>>>>> I am wondering about two things:
>>>>>>
>>>>>> 1. what should be the default value for the new parameter?
>>>>>> 2. why is the new parameter provided in ms?
>>>>>>
>>>>>> About (2): because
>>>>>>
>>>>>> "the minimum checkpoint interval will be the value of
>>>>>> commit.interval.ms. In effect the actual checkpoint interval will be
>> a
>>>>>> multiple of the commit interval"
>>>>>>
>>>>>> it might be easier to just use an parameter that is "number-or-commit
>>>>>> intervals".
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 2/1/17 7:29 AM, Damian Guy wrote:
>>>>>>> Thanks for the comments Eno.
>>>>>>> As for exactly once, i don't believe this matters as we are just
>>>>>> restoring
>>>>>>> the change-log, i.e, the result of the aggregations that previously
>> ran
>>>>>>> etc. So once initialized the state store will be in the same state as
>>>> it
>>>>>>> was before.
>>>>>>> Having the checkpoint in a kafka topic is not ideal as the state is
>> per
>>>>>>> kafka streams instance. So each instance would need to start with a
>>>>>> unique
>>>>>>> id that is persistent.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Damian
>>>>>>>
>>>>>>> On Wed, 1 Feb 2017 at 13:20 Eno Thereska <eno.there...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> As a follow up to my previous comment, have you thought about
>> writing
>>>>>> the
>>>>>>>> checkpoint to a topic instead of a local file? That would have the
>>>>>>>> advantage that all metadata continues to be managed by Kafka, as
>> well
>>>> as
>>>>>>>> fit with EoS. The potential disadvantage would be a slower latency,
>>>>>> however
>>>>>>>> if it is periodic as you mention, I'm not sure that would be a show
>>>>>> stopper.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Eno
>>>>>>>>> On 1 Feb 2017, at 12:58, Eno Thereska <eno.there...@gmail.com>
>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Damian, this is a good idea and will reduce the restore
>> time.
>>>>>>>> Looking forward, with exactly once and support for transactions in
>>>>>> Kafka, I
>>>>>>>> believe we'll have to add some support for rolling back checkpoints,
>>>>>> e.g.,
>>>>>>>> when a transaction is aborted. We need to be aware of that and
>> ideally
>>>>>>>> anticipate a bit those needs in the KIP.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Eno
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 1 Feb 2017, at 10:18, Damian Guy <damian....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I would like to start the discussion on KIP-116:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-116+-+Add+State+Store+Checkpoint+Interval+Configuration
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Damian
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>>