Re: Event time join

Vishal Santoshi Thu, 08 Mar 2018 09:48:55 -0800

Yep.  I think this leads to this general question and may be not pertinent
to https://github.com/apache/flink/pull/5342.  How do we throttle a source
if the held back data gets unreasonably large ? I know that that is in
itself a broader question but delayed watermarks of slow stream accentuates
the issue . I am curious to know how credit based back pressure handling
plays or is that outside the realm of this discussion ? And is credit
based back
pressure handling in 1.5 release.


On Thu, Mar 8, 2018 at 12:23 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> The join would not cause backpressure but rather put all events that
> cannot be processed yet into state to process them later.
> So this works well if the data that is provided by the streams is roughly
> aligned by event time.
>
> 2018-03-08 9:04 GMT-08:00 Vishal Santoshi <vishal.santo...@gmail.com>:
>
>> Aah we have it here https://docs.google.com/d
>> ocument/d/16GMH5VM6JJiWj_N0W8y3PtQ1aoJFxsKoOTSYOfqlsRE/edit#
>> heading=h.bgl260hr56g6
>>
>> On Thu, Mar 8, 2018 at 11:45 AM, Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> This is very interesting.  I would imagine that there will be high back
>>> pressure on the LEFT source effectively throttling it but as is the current
>>> state that is likely effect other pipelines as the free o/p buffer on the
>>> source side and and i/p buffers on the consumer side start blocking and get
>>> exhausted for all other pipes. I am very interested in how holding back the
>>> busy source does not create a pathological  issue where that source is
>>> forever held back. Is there a FLIP for it ?
>>>
>>> On Thu, Mar 8, 2018 at 11:29 AM, Fabian Hueske <fhue...@gmail.com>
>>> wrote:
>>>
>>>> Hi Gytis,
>>>>
>>>> Flink does currently not support holding back individual streams, for
>>>> example it is not possible to align streams on (offset) event-time.
>>>>
>>>> However, the Flink community is working on a windowed join for the
>>>> DataStream API, that only holds the relevant tail of the stream as state.
>>>> If your join condition is +/- 5 minutes then, the join would store he
>>>> last five minutes of both streams as state. Here's an implementation of the
>>>> operator [1] that is close to be merged and will be available in Flink
>>>> 1.6.0.
>>>> Flink's SQL support (and Table API) support this join type since
>>>> version 1.4.0 [2].
>>>>
>>>> Best, Fabian
>>>>
>>>> [1] https://github.com/apache/flink/pull/5342
>>>> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/
>>>> dev/table/sql.html#joins
>>>>
>>>> 2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilins...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> we're considering flink for a couple of our projects. I'm doing a
>>>>> trial implementation for one of them. So far, I like a lot of things,
>>>>> however there are a couple of issues that I can't figure out how to
>>>>> resolve. Not sure if it's me misunderstanding the tool, or flink just
>>>>> doesn't have a capability to do it.
>>>>>
>>>>> We want to do an event time join on two big kafka streams. Both of
>>>>> them might experience some issues on the other end and be delayed.
>>>>> Additionally, while both are big, one (let's call it stream A) is
>>>>> significantly larger than stream B.
>>>>>
>>>>> We also know, that the join window is around 5min. That is, given some
>>>>> key K in stream B, if there is a counterpart in stream A, it's going
>>>>> to be +/5 5min in event time.
>>>>>
>>>>> Since stream A is especially heavy and it's unfeasable to keep hours
>>>>> of it in memory, I would imagine an ideal solution where we read both
>>>>> streams from Kafka. We always make sure that stream B is ahead by
>>>>> 10min, that is, if stream A is currently ahead in watermarks, we stall
>>>>> it and consume stream B until it catches up. Once the stream are
>>>>> alligned in event time (with the 10min delay window) we run them both
>>>>> through join.
>>>>>
>>>>> The problem is, that I find a mechanism to implement that in flink. If
>>>>> I try to do a CoProcessFunction then it just consumes both streams at
>>>>> the same time, ingests a lot of messages from stream A, runs out of
>>>>> memory and dies.
>>>>>
>>>>> Any ideas on how this could be solved?
>>>>>
>>>>> (here's a thread with a very similar problem from some time ago
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab
>>>>> ble.com/synchronizing-two-streams-td6830.html)
>>>>>
>>>>> Regards,
>>>>> Gytis
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Event time join

Reply via email to