Hi Gytis, Flink does currently not support holding back individual streams, for example it is not possible to align streams on (offset) event-time.
However, the Flink community is working on a windowed join for the DataStream API, that only holds the relevant tail of the stream as state. If your join condition is +/- 5 minutes then, the join would store he last five minutes of both streams as state. Here's an implementation of the operator [1] that is close to be merged and will be available in Flink 1.6.0. Flink's SQL support (and Table API) support this join type since version 1.4.0 [2]. Best, Fabian [1] https://github.com/apache/flink/pull/5342 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/sql.html#joins 2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilins...@gmail.com>: > Hi, > > we're considering flink for a couple of our projects. I'm doing a > trial implementation for one of them. So far, I like a lot of things, > however there are a couple of issues that I can't figure out how to > resolve. Not sure if it's me misunderstanding the tool, or flink just > doesn't have a capability to do it. > > We want to do an event time join on two big kafka streams. Both of > them might experience some issues on the other end and be delayed. > Additionally, while both are big, one (let's call it stream A) is > significantly larger than stream B. > > We also know, that the join window is around 5min. That is, given some > key K in stream B, if there is a counterpart in stream A, it's going > to be +/5 5min in event time. > > Since stream A is especially heavy and it's unfeasable to keep hours > of it in memory, I would imagine an ideal solution where we read both > streams from Kafka. We always make sure that stream B is ahead by > 10min, that is, if stream A is currently ahead in watermarks, we stall > it and consume stream B until it catches up. Once the stream are > alligned in event time (with the 10min delay window) we run them both > through join. > > The problem is, that I find a mechanism to implement that in flink. If > I try to do a CoProcessFunction then it just consumes both streams at > the same time, ingests a lot of messages from stream A, runs out of > memory and dies. > > Any ideas on how this could be solved? > > (here's a thread with a very similar problem from some time ago > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/synchronizing-two-streams-td6830.html) > > Regards, > Gytis >