Hi, I have been looking into using Spark streaming for the specific use case of 
joining events of data from multiple time-series streams. 

The part that I am having a hard time understanding is the consistency 
semantics of this across multiple streams. As per [1] Section 4.3.4, I 
understand that Spark has the notion of RDD's (i.e. micro batch time) across 
multiple streams and they are synchronized. 

But these batch time’s probably have no relation to the actual event times 
within that batch. So if I have two streams each with 2 minutes worth of data, 
I do not see yet how these could be ingested in a synchronized manner into 
Spark such that spark can maintain alignment of these boundaries.  

Or put another way, as a producer of these streams for example from Kinesis, I 
have no notion of batch times. Given that, if I had multiple streams, I do not 
see how Spark could synchronize these multiple streams. 

What am I missing?

Thanks,
Ashwin

[1] http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to