I find monoids pretty useful in this respect, basically separating out the logic in a monoid and then applying the logic to either a stream or a batch. A list of such practices could be really useful.
On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud <j...@tellapart.com> wrote: > Hey, > > It seems pretty clear that one of the strength of Spark is to be able to > share your code between your batch and streaming layer. Though, given that > Spark streaming uses DStream being a set of RDDs and Spark uses a single > RDD there might some complexity associated with it. > > Of course since DStream is a superset of RDDs, one can just run the same > code at the RDD granularity using DStream::forEachRDD. While this should > work for map, I am not sure how that can work when it comes to reduce phase > given that a group of keys spans across multiple RDDs. > > One of the option is to change the dataset object on which a job works on. > For example of passing an RDD to a class method, one passes a higher level > object (MetaRDD) that wraps around RDD or DStream depending the context. At > this point the job calls its regular maps, reduces and so on and the > MetaRDD wrapper would delegate accordingly. > > Just would like to know the official best practice from the spark > community though. > > Thanks, > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com