I have two streams of data flowing into a Kafka cluster. I want to process this 
data with Kafka streams. The stream producers are started at some time t.

I start up the Kafka Streams job 5 minutes later and start reading from 
earliest from both topics (about 900 000 messages already on each topic). The 
job parses the data and joins the two streams. On the intermediate topics, I 
see that all the earlier 2x900000 events are flowing through until the join. 
However, only 250 000 are outputted from the join, instead of 900 000. 

After processing the burst, the code works perfect on the new incoming events.

Grace ms of the join is put on 50 millis but putting it on 5 minutes or 10 
minutes makes no difference. My other settings:

    auto.offset.reset = earliest
    commit.interval.ms = 1000
    batch.size = 204800
    max.poll.records = 500000

The join is on the following window:

JoinWindows.of(Duration.ofMillis(1000))
      .grace(Duration.ofMillis(50))

As far as I understand these joins should be done on event time. What that can 
cause these results to be discarded?

Reply via email to