Hi, Suppose that we have two topics, one with each event size 100 bytes and the other each event size 5000 bytes. Producers are producing events (with timestamps) for 3 days in both topics, same amount of events, let assume 100000 events in each topic.
Kafka Client API consumers will obviously consume both topics in different amount of time. Now after 3 days application starts consuming these topics as High-level API KStream. If we need to join these two streams on time based, let's say 15 minutes window, how it would behave then? Obviously, stream with smaller events will be consumed faster than stream with larger event size. Stream100 could be reading 1 day ago data, while stream5000 would still be reading 3 days ago events. Is there any global time synchronization between streams in Kafka Streams API? So that, it would not consume more events from one stream while the other is still behind in time. Or probably better to rephrase it like, is there global event ordering based on timestamp of event? The other thing could be to join streams in window, however same question arises, if one stream days behind the other, will the join window of 15 minutes ever work? I'm trying to grasp a way on how to design replay of long periods of time for application with multiple topics/streams. Especially when combining with low-level API processors and transformers which relay on each other via GlobalKTable or KTable stores on these streams. For instance, smaller topic could have the following sequence of events: T1 - (k1, v1) T1 + 10 minutes - (k1, null) T1 + 20 minutes - (k1, v2) While topic with larger events: T1 - (k1, vt1) T1 + 5 minutes - (k1, null) T1 + 15 minutes - (k1, vt2) If one would join or lookup these streams in realtime (timestamp of event is approximately = wall clock time) result would be: T1 - topic_small (k1, v1) - topic_large (k1, vt1) T1 + 5 minutes - topic_small (k1, v1) - topic_large (k1, null) T1 + 10 minutes - topic_small (k1, null) - topic_large (k1, null) T1 + 15 minutes - topic_small (k1, null) - topic_large (k1, vt2) T1 + 20 minutes - topic_small (k1, v2) - topic_large (k1, vt2) However, when replaying streams from beginning, from perspective of topic with large events, it would see topic with small events as (k1, v2), completely missing v1 and null states in case of GlobalKTable/KTable presentation or events in case of KStream-KStream windowed join. Do I miss something here? Should application be responsible in global synchronization between topics, or Kafka Streams does / can do that? If application should, then what could be approach to solve it? I hope I could explain myself. Thanks in advance