Hi Jon,

In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
interchangeably. If you are trying to collect multiple batches across a
DStream into a single RDD, look at the window() operations.

Hope this helps
Nikunj


On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <jon.ch...@gmail.com> wrote:

> I should note that the amount of data in each batch is very small, so I'm
> not concerned with performance implications of grouping into a single RDD.
>
> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <jon.ch...@gmail.com> wrote:
>
>> I'm currently doing something like this in my Spark Streaming program
>> (Java):
>>
>>         dStream.foreachRDD((rdd, batchTime) -> {
>>             log.info("processing RDD from batch {}", batchTime);
>>             ....
>>             // my rdd processing code
>>             ....
>>         });
>>
>> Instead of having my rdd processing code called once for each RDD in the
>> batch, is it possible to essentially group all of the RDDs from the batch
>> into a single RDD and single partition and therefore operate on all of the
>> elements in the batch at once?
>>
>> My goal here is to do an operation exactly once for every batch.  As I
>> understand it, foreachRDD is going to do the operation once for each RDD in
>> the batch, which is not what I want.
>>
>> I've looked at DStream.repartition(int), but the docs make it sound like
>> it only changes the number of partitions in the batch's existing RDDs, not
>> the number of RDDs.
>>
>
>

Reply via email to