Re: Event time join

Vishal Santoshi Thu, 08 Mar 2018 09:05:13 -0800

Aah we have it here
https://docs.google.com/document/d/16GMH5VM6JJiWj_N0W8y3PtQ1aoJFxsKoOTSYOfqlsRE/edit#heading=h.bgl260hr56g6


On Thu, Mar 8, 2018 at 11:45 AM, Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> This is very interesting.  I would imagine that there will be high back
> pressure on the LEFT source effectively throttling it but as is the current
> state that is likely effect other pipelines as the free o/p buffer on the
> source side and and i/p buffers on the consumer side start blocking and get
> exhausted for all other pipes. I am very interested in how holding back the
> busy source does not create a pathological  issue where that source is
> forever held back. Is there a FLIP for it ?
>
> On Thu, Mar 8, 2018 at 11:29 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Gytis,
>>
>> Flink does currently not support holding back individual streams, for
>> example it is not possible to align streams on (offset) event-time.
>>
>> However, the Flink community is working on a windowed join for the
>> DataStream API, that only holds the relevant tail of the stream as state.
>> If your join condition is +/- 5 minutes then, the join would store he
>> last five minutes of both streams as state. Here's an implementation of the
>> operator [1] that is close to be merged and will be available in Flink
>> 1.6.0.
>> Flink's SQL support (and Table API) support this join type since version
>> 1.4.0 [2].
>>
>> Best, Fabian
>>
>> [1] https://github.com/apache/flink/pull/5342
>> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/
>> dev/table/sql.html#joins
>>
>> 2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilins...@gmail.com>:
>>
>>> Hi,
>>>
>>> we're considering flink for a couple of our projects. I'm doing a
>>> trial implementation for one of them. So far, I like a lot of things,
>>> however there are a couple of issues that I can't figure out how to
>>> resolve. Not sure if it's me misunderstanding the tool, or flink just
>>> doesn't have a capability to do it.
>>>
>>> We want to do an event time join on two big kafka streams. Both of
>>> them might experience some issues on the other end and be delayed.
>>> Additionally, while both are big, one (let's call it stream A) is
>>> significantly larger than stream B.
>>>
>>> We also know, that the join window is around 5min. That is, given some
>>> key K in stream B, if there is a counterpart in stream A, it's going
>>> to be +/5 5min in event time.
>>>
>>> Since stream A is especially heavy and it's unfeasable to keep hours
>>> of it in memory, I would imagine an ideal solution where we read both
>>> streams from Kafka. We always make sure that stream B is ahead by
>>> 10min, that is, if stream A is currently ahead in watermarks, we stall
>>> it and consume stream B until it catches up. Once the stream are
>>> alligned in event time (with the 10min delay window) we run them both
>>> through join.
>>>
>>> The problem is, that I find a mechanism to implement that in flink. If
>>> I try to do a CoProcessFunction then it just consumes both streams at
>>> the same time, ingests a lot of messages from stream A, runs out of
>>> memory and dies.
>>>
>>> Any ideas on how this could be solved?
>>>
>>> (here's a thread with a very similar problem from some time ago
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>> ble.com/synchronizing-two-streams-td6830.html)
>>>
>>> Regards,
>>> Gytis
>>>
>>
>>
>

Re: Event time join

Reply via email to