Re: streaming join implementation

2016-04-15 Thread Aljoscha Krettek
I'll try and answer both questions. Regarding Henry's question about very large state and caching: this depends on the StateBackend. The FsStateBackend has to keep all state on the JVM heap in hash-maps. If you have the appropriate number of machines which large memory then this could still work.

Re: streaming join implementation

2016-04-14 Thread Andrew Coates
Extending on what Henry is asking... What if data can be more that a day late, or in a more streaming nature, what if updates can come through for previous values? This would obviously involve storing a great deal of state. The use case I'm thinking of has large large volumes per day. So an extern

Re: streaming join implementation

2016-04-14 Thread Henry Cai
Cogroup is nice, thanks. But if I define a tumbling window of one day, does that mean flink needs to cache all the data for one day in memory? I have about 5TB of data coming for one day. About 50% records will find a matching records (the other 50% doesn't). On Thu, Apr 14, 2016 at 9:05 AM, A

Re: streaming join implementation

2016-04-14 Thread Aljoscha Krettek
Hi, right now, Flink does not give you a way to get the the records that where not joined for a join. You can, however use a co-group operation instead of a join to figure out which records did not join with records from the other side and treat them separately. Let me show an example: val input1

Re: streaming join implementation

2016-04-13 Thread Balaji Rajagopalan
Let me give you specific example, say stream1 event1 happened within your window 0-5 min with key1, and event2 on stream2 with key2 which could have matched with key1 happened at 5:01 outside the join window, so now you will have to co-relate the event2 on stream2 with the event1 with stream1 which

Re: streaming join implementation

2016-04-13 Thread Henry Cai
Thanks Balaji. Do you mean you spill the non-matching records after 5 minutes into redis? Does flink give you control on which records is not matching in the current window such that you can copy into a long-term storage? On Wed, Apr 13, 2016 at 11:20 PM, Balaji Rajagopalan < balaji.rajagopa..

Re: streaming join implementation

2016-04-13 Thread Balaji Rajagopalan
You can implement join in flink (which is a inner join) the below mentioned pseudo code . The below join is for a 5 minute interval, yes will be some corners cases when the data coming after 5 minutes will be missed out in the join window, I actually had solved this problem but storing some data i

streaming join implementation

2016-04-13 Thread Henry Cai
Hi, We are evaluating different streaming platforms. For a typical join between two streams select a.*, b.* FROM a, b ON a.id == b.id How does flink implement the join? The matching record from either stream can come late, we consider it's a valid join as long as the event time for record a an