Hi, we're considering flink for a couple of our projects. I'm doing a trial implementation for one of them. So far, I like a lot of things, however there are a couple of issues that I can't figure out how to resolve. Not sure if it's me misunderstanding the tool, or flink just doesn't have a capability to do it.
We want to do an event time join on two big kafka streams. Both of them might experience some issues on the other end and be delayed. Additionally, while both are big, one (let's call it stream A) is significantly larger than stream B. We also know, that the join window is around 5min. That is, given some key K in stream B, if there is a counterpart in stream A, it's going to be +/5 5min in event time. Since stream A is especially heavy and it's unfeasable to keep hours of it in memory, I would imagine an ideal solution where we read both streams from Kafka. We always make sure that stream B is ahead by 10min, that is, if stream A is currently ahead in watermarks, we stall it and consume stream B until it catches up. Once the stream are alligned in event time (with the 10min delay window) we run them both through join. The problem is, that I find a mechanism to implement that in flink. If I try to do a CoProcessFunction then it just consumes both streams at the same time, ingests a lot of messages from stream A, runs out of memory and dies. Any ideas on how this could be solved? (here's a thread with a very similar problem from some time ago http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/synchronizing-two-streams-td6830.html) Regards, Gytis