Hi Jon, In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used interchangeably. If you are trying to collect multiple batches across a DStream into a single RDD, look at the window() operations.
Hope this helps Nikunj On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jon.ch...@gmail.com> wrote: > I should note that the amount of data in each batch is very small, so I'm > not concerned with performance implications of grouping into a single RDD. > > On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote: > >> I'm currently doing something like this in my Spark Streaming program >> (Java): >> >> dStream.foreachRDD((rdd, batchTime) -> { >> log.info("processing RDD from batch {}", batchTime); >> .... >> // my rdd processing code >> .... >> }); >> >> Instead of having my rdd processing code called once for each RDD in the >> batch, is it possible to essentially group all of the RDDs from the batch >> into a single RDD and single partition and therefore operate on all of the >> elements in the batch at once? >> >> My goal here is to do an operation exactly once for every batch. As I >> understand it, foreachRDD is going to do the operation once for each RDD in >> the batch, which is not what I want. >> >> I've looked at DStream.repartition(int), but the docs make it sound like >> it only changes the number of partitions in the batch's existing RDDs, not >> the number of RDDs. >> > >