Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

Mark Hamstra Wed, 11 Feb 2015 15:01:39 -0800

No, only each group should need to fit.

On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet <cjno...@gmail.com> wrote:


> Doesn't iter still need to fit entirely into memory?
>
> On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> rdd.mapPartitions { iter =>
>>   val grouped = iter.grouped(batchSize)
>>   for (group <- grouped) { ... }
>> }
>>
>> On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet <cjno...@gmail.com> wrote:
>>
>>> I think the word "partition" here is a tad different than the term
>>> "partition" that we use in Spark. Basically, I want something similar to
>>> Guava's Iterables.partition [1], that is, If I have an RDD[People] and I
>>> want to run an algorithm that can be optimized by working on 30 people at a
>>> time, I'd like to be able to say:
>>>
>>> val rdd: RDD[People] = .....
>>> val partitioned: RDD[Seq[People]] = rdd.partition(30)....
>>>
>>> I also don't want any shuffling- everything can still be processed
>>> locally.
>>>
>>>
>>> [1]
>>> http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)
>>>
>>
>>
>

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

Reply via email to