Hi,

we're considering flink for a couple of our projects. I'm doing a
trial implementation for one of them. So far, I like a lot of things,
however there are a couple of issues that I can't figure out how to
resolve. Not sure if it's me misunderstanding the tool, or flink just
doesn't have a capability to do it.

We want to do an event time join on two big kafka streams. Both of
them might experience some issues on the other end and be delayed.
Additionally, while both are big, one (let's call it stream A) is
significantly larger than stream B.

We also know, that the join window is around 5min. That is, given some
key K in stream B, if there is a counterpart in stream A, it's going
to be +/5 5min in event time.

Since stream A is especially heavy and it's unfeasable to keep hours
of it in memory, I would imagine an ideal solution where we read both
streams from Kafka. We always make sure that stream B is ahead by
10min, that is, if stream A is currently ahead in watermarks, we stall
it and consume stream B until it catches up. Once the stream are
alligned in event time (with the 10min delay window) we run them both
through join.

The problem is, that I find a mechanism to implement that in flink. If
I try to do a CoProcessFunction then it just consumes both streams at
the same time, ingests a lot of messages from stream A, runs out of
memory and dies.

Any ideas on how this could be solved?

(here's a thread with a very similar problem from some time ago
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/synchronizing-two-streams-td6830.html)

Regards,
Gytis

Reply via email to