I'll try and answer both questions.
Regarding Henry's question about very large state and caching: this depends
on the StateBackend. The FsStateBackend has to keep all state on the JVM
heap in hash-maps. If you have the appropriate number of machines which
large memory then this could still work.
Extending on what Henry is asking... What if data can be more that a day
late, or in a more streaming nature, what if updates can come through for
previous values?
This would obviously involve storing a great deal of state. The use case
I'm thinking of has large large volumes per day. So an extern
Cogroup is nice, thanks.
But if I define a tumbling window of one day, does that mean flink needs to
cache all the data for one day in memory? I have about 5TB of data coming
for one day. About 50% records will find a matching records (the other 50%
doesn't).
On Thu, Apr 14, 2016 at 9:05 AM, A
Hi,
right now, Flink does not give you a way to get the the records that where
not joined for a join. You can, however use a co-group operation instead of
a join to figure out which records did not join with records from the other
side and treat them separately.
Let me show an example:
val input1
Let me give you specific example, say stream1 event1 happened within your
window 0-5 min with key1, and event2 on stream2 with key2 which could have
matched with key1 happened at 5:01 outside the join window, so now you will
have to co-relate the event2 on stream2 with the event1 with stream1 which
Thanks Balaji. Do you mean you spill the non-matching records after 5
minutes into redis? Does flink give you control on which records is not
matching in the current window such that you can copy into a long-term
storage?
On Wed, Apr 13, 2016 at 11:20 PM, Balaji Rajagopalan <
balaji.rajagopa..
You can implement join in flink (which is a inner join) the below mentioned
pseudo code . The below join is for a 5 minute interval, yes will be some
corners cases when the data coming after 5 minutes will be missed out in
the join window, I actually had solved this problem but storing some data
i
Hi,
We are evaluating different streaming platforms. For a typical join
between two streams
select a.*, b.*
FROM a, b
ON a.id == b.id
How does flink implement the join? The matching record from either stream
can come late, we consider it's a valid join as long as the event time for
record a an