Understanding the topology of high level kafka stream

Sachin Mittal Wed, 09 Nov 2016 06:01:42 -0800

Hi,
I had some basic questions on sequence of tasks for streaming application
restart in case of failure or otherwise.


Say my stream is structured this way

source-topic
   branched into 2 kstreams
    source-topic-1
    source-topic-2
   each mapped to 2 new kstreams (new key,value pairs) backed by 2 kafka
topics
       source-topic-1-new
       source-topic-2-new
       each aggregated to new ktable backed by internal changelog topics
       source-topic-1-new-table (scource-topic-1-new-changelog)
       source-topic-2-new-table (scource-topic-2-new-changelog)
       table1 left join table2 -> to final stream
Results of final stream are then persisted into another data storage

So if you see I have following physical topics or state stores
source-topic
source-topic-1-new
source-topic-2-new
scource-topic-1-new-changelog
scource-topic-2-new-changelog

Now at a give point if the streaming application is stopped there is some
data in all these topics.
Barring the source-topic all other topic has data inserted by the streaming
application.

Also I suppose streaming application stores the offset for each of the
topic as where it was last.

So when I restart the application how does the processing starts again?
Will it pick the data from last left changelog topics and process them
first and then process the source topic data from the offset last left?

Or it will start from source topic. I really don't want it to maintain
offset to changelog tables because any old key's value can be modified as
part of aggregation again.

Bit confused here, any light would help a lot.

Thanks
Sachin

Understanding the topology of high level kafka stream

Reply via email to