Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do strictly less work to set the infrastructure that you want to build.
Yes, Spark Streaming currently does not support providing own timer, because the logic to handle delays etc, is pretty complex and specific to each application. Usually that logic can be implemented on top of the windowing solutoin that Spark Streaming already provides. TD On Thu, Feb 5, 2015 at 7:37 AM, Zilvinas Saltys <zilvinas.sal...@gmail.com> wrote: > The challenge I have is this. There's two streams of data where an event > might look like this in stream1: (time, hashkey, foo1) and in stream2: > (time, hashkey, foo2) > The result after joining should be (time, hashkey, foo1, foo2) .. The join > happens on hashkey and the time difference can be ~30 mins between events. > The amount of data is enormous .. hundreds of billions of events per > month. I need not only join the existing history data but continue to do so > with incoming data (comes in batches not really streamed) > > For now I was thinking to implement this in MapReduce and sliding windows > .. I'm wondering if spark can actually help me with this sort of challenge? > How would a join of two huge streams of historic data would actually > happen internally within spark and would it be more efficient than let's > say hive map reduce stream join of two big tables? > > I also saw spark streaming has windowing support but it seems you cannot > provide your own timer? As in I cannot make the time be derived from events > itself rather than having an actual clock running. > > Thanks, >