Hi Murad, On Wed, 26 Apr 2017 at 13:37 Murad Mamedov <m...@muradm.net> wrote:
> Is there any global time synchronization between streams in Kafka Streams > API? So that, it would not consume more events from one stream while the > other is still behind in time. Or probably better to rephrase it like, is > there global event ordering based on timestamp of event? > Yes. When streams are joined each partition from the joined streams are grouped together into a single Task. Each Task maintains a record buffer for all of the topics it is consuming from. When it is time process a record it will chose a record from the partition that has the smallest timestamp. So in this way it makes a best effort to keep the streams in sync. > > The other thing could be to join streams in window, however same question > arises, if one stream days behind the other, will the join window of 15 > minutes ever work? > > If the data is arriving much later you can use JoinWindows.until(SOME_TIME_PERIOD) to keep the data around. In this case the streams will still join. Once SOME_TIME_PERIOD has expired the streams will no longer be able to join. > I'm trying to grasp a way on how to design replay of long periods of time > for application with multiple topics/streams. Especially when combining > with low-level API processors and transformers which relay on each other > via GlobalKTable or KTable stores on these streams. For instance, smaller > topic could have the following sequence of events: > > T1 - (k1, v1) > T1 + 10 minutes - (k1, null) > T1 + 20 minutes - (k1, v2) > > While topic with larger events: > > T1 - (k1, vt1) > T1 + 5 minutes - (k1, null) > T1 + 15 minutes - (k1, vt2) > > If one would join or lookup these streams in realtime (timestamp of event > is approximately = wall clock time) result would be: > > T1 - topic_small (k1, v1) - topic_large (k1, vt1) > T1 + 5 minutes - topic_small (k1, v1) - topic_large (k1, null) > T1 + 10 minutes - topic_small (k1, null) - topic_large (k1, null) > T1 + 15 minutes - topic_small (k1, null) - topic_large (k1, vt2) > T1 + 20 minutes - topic_small (k1, v2) - topic_large (k1, vt2) > > However, when replaying streams from beginning, from perspective of topic > with large events, it would see topic with small events as (k1, v2), > completely missing v1 and null states in case of GlobalKTable/KTable > presentation or events in case of KStream-KStream windowed join. > > I don't really follow here. In the case of a GlobalKTable it will be initialized with all of the existing data before the rest of the streams start processing. > Do I miss something here? Should application be responsible in global > synchronization between topics, or Kafka Streams does / can do that? If > application should, then what could be approach to solve it? > > I hope I could explain myself. > > Thanks in advance >