Looks like this method should serve Jon's needs:
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
On Wed, Jul 15, 2015 at 8:23 PM, N B <[email protected]> wrote:
> Hi Jon,
>
> In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used
> interchangeably. If you are trying to collect multiple batches across a
> DStream into a single RDD, look at the window() operations.
>
> Hope this helps
> Nikunj
>
>
> On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase <[email protected]> wrote:
>
>> I should note that the amount of data in each batch is very small, so I'm
>> not concerned with performance implications of grouping into a single RDD.
>>
>> On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase <[email protected]> wrote:
>>
>>> I'm currently doing something like this in my Spark Streaming program
>>> (Java):
>>>
>>> dStream.foreachRDD((rdd, batchTime) -> {
>>> log.info("processing RDD from batch {}", batchTime);
>>> ....
>>> // my rdd processing code
>>> ....
>>> });
>>>
>>> Instead of having my rdd processing code called once for each RDD in the
>>> batch, is it possible to essentially group all of the RDDs from the batch
>>> into a single RDD and single partition and therefore operate on all of the
>>> elements in the batch at once?
>>>
>>> My goal here is to do an operation exactly once for every batch. As I
>>> understand it, foreachRDD is going to do the operation once for each RDD in
>>> the batch, which is not what I want.
>>>
>>> I've looked at DStream.repartition(int), but the docs make it sound like
>>> it only changes the number of partitions in the batch's existing RDDs, not
>>> the number of RDDs.
>>>
>>
>>
>