Hi,

Suppose that we have two topics, one with each event size 100 bytes and the
other each event size 5000 bytes. Producers are producing events (with
timestamps) for 3 days in both topics, same amount of events, let assume
100000 events in each topic.

Kafka Client API consumers will obviously consume both topics in different
amount of time.

Now after 3 days application starts consuming these topics as High-level
API KStream.

If we need to join these two streams on time based, let's say 15 minutes
window, how it would behave then?

Obviously, stream with smaller events will be consumed faster than stream
with larger event size. Stream100 could be reading 1 day ago data, while
stream5000 would still be reading 3 days ago events.

Is there any global time synchronization between streams in Kafka Streams
API? So that, it would not consume more events from one stream while the
other is still behind in time. Or probably better to rephrase it like, is
there global event ordering based on timestamp of event?

The other thing could be to join streams in window, however same question
arises, if one stream days behind the other, will the join window of 15
minutes ever work?

I'm trying to grasp a way on how to design replay of long periods of time
for application with multiple topics/streams. Especially when combining
with low-level API processors and transformers which relay on each other
via GlobalKTable or KTable stores on these streams. For instance, smaller
topic could have the following sequence of events:

T1 - (k1, v1)
T1 + 10 minutes - (k1, null)
T1 + 20 minutes - (k1, v2)

While topic with larger events:

T1 - (k1, vt1)
T1 + 5 minutes - (k1, null)
T1 + 15 minutes - (k1, vt2)

If one would join or lookup these streams in realtime (timestamp of event
is approximately = wall clock time) result would be:

T1 - topic_small (k1, v1) - topic_large (k1, vt1)
T1 + 5 minutes - topic_small (k1, v1) - topic_large (k1, null)
T1 + 10 minutes - topic_small (k1, null) - topic_large (k1, null)
T1 + 15 minutes - topic_small (k1, null) - topic_large (k1, vt2)
T1 + 20 minutes - topic_small (k1, v2) - topic_large (k1, vt2)

However, when replaying streams from beginning, from perspective of topic
with large events, it would see topic with small events as (k1, v2),
completely missing v1 and null states in case of GlobalKTable/KTable
presentation or events in case of KStream-KStream windowed join.

Do I miss something here? Should application be responsible in global
synchronization between topics, or Kafka Streams does / can do that? If
application should, then what could be approach to solve it?

I hope I could explain myself.

Thanks in advance

Reply via email to