Re: Time synchronization between streams

m...@muradm.net Wed, 26 Apr 2017 06:58:07 -0700

Yes, basically I'm ok with how join works including window andretention periods, under normal circumstances. In real time ofoccurrence of events, application joining streams will get somethinglike this:


T1 + 0 => topic_small (K1, V1)  => join result (None)
T1 + 1 min =>  topic_large (K1, VT1) => join result (K1, V1, VT1)
T1 + 3 mins => topic_large (K1, VT2) => join result (K1, V1, VT2)
T1 + 7 mins => topic_small (K1, V2) => join result (K1, V2, VT2)

According to Windowed<K> and WindowedSerializer it keeps only start ofwindow with key when storing it to state store. Assuming that windowstart time same for both topics/KStreams (not sure yet, still readingsource), but even if not same, state stores actions of Kafka Streamswill be like this:


join_left_side_store.put ( K1-W1, V1 )
join_right_side_store.put ( K1-W1, VT1 )
join_left_side_store.put ( K1-W1, V2 )
join_right_side_store.put ( K1-W1, VT2 )

However when consuming same topics by the same application frombeginning from scratch (no application local state stores) for largeperiod of time (greater than window period, but less than retentionperiod), join result for 10 minutes window will be different, like this:


join result (None)
join result (K1, V2, VT1)
join result (K1, V2, VT2)

Because topic_large's stream is being read slower, value of topic_smallin window will change from V1 to V2, before Kafka Streams will receiveVT1.


I.e. state stores actions of Kafka Streams will be like this:

join_left_side_store.put ( K1-W1, V1 )
join_left_side_store.put ( K1-W1, V2 )
join_right_side_store.put ( K1-W1, VT1 )
join_right_side_store.put ( K1-W1, VT2 )

Isn't it?

On Wed, Apr 26, 2017 at 6:50 PM, Damian Guy <damian....@gmail.com>wrote:

Hi Murad,

On Wed, 26 Apr 2017 at 13:37 Murad Mamedov <m...@muradm.net> wrote:
Is there any global time synchronization between streams in KafkaStreamsAPI? So that, it would not consume more events from one streamwhile theother is still behind in time. Or probably better to rephrase itlike, is
 there global event ordering based on timestamp of event?
Yes. When streams are joined each partition from the joined streamsaregrouped together into a single Task. Each Task maintains a recordbuffer
for all of the topics it is consuming from. When it is time process a
record it will chose a record from the partition that has the smallest
timestamp. So in this way it makes a best effort to keep the streamsin
sync.
The other thing could be to join streams in window, however samequestionarises, if one stream days behind the other, will the join windowof 15
 minutes ever work?
If the data is arriving much later you can use
JoinWindows.until(SOME_TIME_PERIOD) to keep the data around. In thiscasethe streams will still join. Once SOME_TIME_PERIOD has expired thestreams
will no longer be able to join.
I'm trying to grasp a way on how to design replay of long periodsof timefor application with multiple topics/streams. Especially whencombiningwith low-level API processors and transformers which relay on eachothervia GlobalKTable or KTable stores on these streams. For instance,smaller
 topic could have the following sequence of events:

 T1 - (k1, v1)
 T1 + 10 minutes - (k1, null)
 T1 + 20 minutes - (k1, v2)

 While topic with larger events:

 T1 - (k1, vt1)
 T1 + 5 minutes - (k1, null)
 T1 + 15 minutes - (k1, vt2)
If one would join or lookup these streams in realtime (timestamp ofevent
 is approximately = wall clock time) result would be:

 T1 - topic_small (k1, v1) - topic_large (k1, vt1)
 T1 + 5 minutes - topic_small (k1, v1) - topic_large (k1, null)
 T1 + 10 minutes - topic_small (k1, null) - topic_large (k1, null)
 T1 + 15 minutes - topic_small (k1, null) - topic_large (k1, vt2)
 T1 + 20 minutes - topic_small (k1, v2) - topic_large (k1, vt2)
However, when replaying streams from beginning, from perspective oftopic
 with large events, it would see topic with small events as (k1, v2),
 completely missing v1 and null states in case of GlobalKTable/KTable
 presentation or events in case of KStream-KStream windowed join.
I don't really follow here. In the case of a GlobalKTable it will be
initialized with all of the existing data before the rest of thestreams
start processing.
Do I miss something here? Should application be responsible inglobalsynchronization between topics, or Kafka Streams does / can dothat? If
 application should, then what could be approach to solve it?

 I hope I could explain myself.

 Thanks in advance

Re: Time synchronization between streams

Reply via email to