Hi Gytis,

Flink does currently not support holding back individual streams, for
example it is not possible to align streams on (offset) event-time.

However, the Flink community is working on a windowed join for the
DataStream API, that only holds the relevant tail of the stream as state.
If your join condition is +/- 5 minutes then, the join would store he last
five minutes of both streams as state. Here's an implementation of the
operator [1] that is close to be merged and will be available in Flink
1.6.0.
Flink's SQL support (and Table API) support this join type since version
1.4.0 [2].

Best, Fabian

[1] https://github.com/apache/flink/pull/5342
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/sql.html#joins

2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilins...@gmail.com>:

> Hi,
>
> we're considering flink for a couple of our projects. I'm doing a
> trial implementation for one of them. So far, I like a lot of things,
> however there are a couple of issues that I can't figure out how to
> resolve. Not sure if it's me misunderstanding the tool, or flink just
> doesn't have a capability to do it.
>
> We want to do an event time join on two big kafka streams. Both of
> them might experience some issues on the other end and be delayed.
> Additionally, while both are big, one (let's call it stream A) is
> significantly larger than stream B.
>
> We also know, that the join window is around 5min. That is, given some
> key K in stream B, if there is a counterpart in stream A, it's going
> to be +/5 5min in event time.
>
> Since stream A is especially heavy and it's unfeasable to keep hours
> of it in memory, I would imagine an ideal solution where we read both
> streams from Kafka. We always make sure that stream B is ahead by
> 10min, that is, if stream A is currently ahead in watermarks, we stall
> it and consume stream B until it catches up. Once the stream are
> alligned in event time (with the 10min delay window) we run them both
> through join.
>
> The problem is, that I find a mechanism to implement that in flink. If
> I try to do a CoProcessFunction then it just consumes both streams at
> the same time, ingests a lot of messages from stream A, runs out of
> memory and dies.
>
> Any ideas on how this could be solved?
>
> (here's a thread with a very similar problem from some time ago
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/synchronizing-two-streams-td6830.html)
>
> Regards,
> Gytis
>

Reply via email to