Yep. I think this leads to this general question and may be not pertinent to https://github.com/apache/flink/pull/5342. How do we throttle a source if the held back data gets unreasonably large ? I know that that is in itself a broader question but delayed watermarks of slow stream accentuates the issue . I am curious to know how credit based back pressure handling plays or is that outside the realm of this discussion ? And is credit based back pressure handling in 1.5 release.
On Thu, Mar 8, 2018 at 12:23 PM, Fabian Hueske <fhue...@gmail.com> wrote: > The join would not cause backpressure but rather put all events that > cannot be processed yet into state to process them later. > So this works well if the data that is provided by the streams is roughly > aligned by event time. > > 2018-03-08 9:04 GMT-08:00 Vishal Santoshi <vishal.santo...@gmail.com>: > >> Aah we have it here https://docs.google.com/d >> ocument/d/16GMH5VM6JJiWj_N0W8y3PtQ1aoJFxsKoOTSYOfqlsRE/edit# >> heading=h.bgl260hr56g6 >> >> On Thu, Mar 8, 2018 at 11:45 AM, Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> This is very interesting. I would imagine that there will be high back >>> pressure on the LEFT source effectively throttling it but as is the current >>> state that is likely effect other pipelines as the free o/p buffer on the >>> source side and and i/p buffers on the consumer side start blocking and get >>> exhausted for all other pipes. I am very interested in how holding back the >>> busy source does not create a pathological issue where that source is >>> forever held back. Is there a FLIP for it ? >>> >>> On Thu, Mar 8, 2018 at 11:29 AM, Fabian Hueske <fhue...@gmail.com> >>> wrote: >>> >>>> Hi Gytis, >>>> >>>> Flink does currently not support holding back individual streams, for >>>> example it is not possible to align streams on (offset) event-time. >>>> >>>> However, the Flink community is working on a windowed join for the >>>> DataStream API, that only holds the relevant tail of the stream as state. >>>> If your join condition is +/- 5 minutes then, the join would store he >>>> last five minutes of both streams as state. Here's an implementation of the >>>> operator [1] that is close to be merged and will be available in Flink >>>> 1.6.0. >>>> Flink's SQL support (and Table API) support this join type since >>>> version 1.4.0 [2]. >>>> >>>> Best, Fabian >>>> >>>> [1] https://github.com/apache/flink/pull/5342 >>>> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ >>>> dev/table/sql.html#joins >>>> >>>> 2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilins...@gmail.com>: >>>> >>>>> Hi, >>>>> >>>>> we're considering flink for a couple of our projects. I'm doing a >>>>> trial implementation for one of them. So far, I like a lot of things, >>>>> however there are a couple of issues that I can't figure out how to >>>>> resolve. Not sure if it's me misunderstanding the tool, or flink just >>>>> doesn't have a capability to do it. >>>>> >>>>> We want to do an event time join on two big kafka streams. Both of >>>>> them might experience some issues on the other end and be delayed. >>>>> Additionally, while both are big, one (let's call it stream A) is >>>>> significantly larger than stream B. >>>>> >>>>> We also know, that the join window is around 5min. That is, given some >>>>> key K in stream B, if there is a counterpart in stream A, it's going >>>>> to be +/5 5min in event time. >>>>> >>>>> Since stream A is especially heavy and it's unfeasable to keep hours >>>>> of it in memory, I would imagine an ideal solution where we read both >>>>> streams from Kafka. We always make sure that stream B is ahead by >>>>> 10min, that is, if stream A is currently ahead in watermarks, we stall >>>>> it and consume stream B until it catches up. Once the stream are >>>>> alligned in event time (with the 10min delay window) we run them both >>>>> through join. >>>>> >>>>> The problem is, that I find a mechanism to implement that in flink. If >>>>> I try to do a CoProcessFunction then it just consumes both streams at >>>>> the same time, ingests a lot of messages from stream A, runs out of >>>>> memory and dies. >>>>> >>>>> Any ideas on how this could be solved? >>>>> >>>>> (here's a thread with a very similar problem from some time ago >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nab >>>>> ble.com/synchronizing-two-streams-td6830.html) >>>>> >>>>> Regards, >>>>> Gytis >>>>> >>>> >>>> >>> >> >