[jira] [Commented] (KAFKA-4113) Allow KTable bootstrap

Greg Fodor (JIRA) Wed, 19 Oct 2016 01:06:06 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587991#comment-15587991
 ]


Greg Fodor commented on KAFKA-4113:
-----------------------------------

I guess what I would argue is that KStreamBuilder#table should have identical 
semantics to a logged state store backed KTable, except you are specifying the 
topic and (obv) it's not mutable from the job's POV. It should first check if 
it has a local, checkpointed rocksdb, and if so, it should just read from the 
checkpoint forward. If not, it should rematerialize from offset 0 and block the 
start of the job until it does. On shutdown, it should write the checkpoint 
file. It seems to me that this might boil down to just having it be "use this 
topic for the logged state store backing this KTableImpl."

I'm sure there are cases I'm missing, but having that be the behavior for 
KStreamBuilder#table would effectively solve all of our problems as far as I 
can tell. The semantics + I/O impact of this approach back out to the same 
exact ones you have when you use a normal user-created persistent state store, 
but just are managing the topic writes yourself.

> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Guozhang Wang
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4113) Allow KTable bootstrap

Reply via email to