Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala

It can be used through sliding(windowSize: Int) in
spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala

Yuhao

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, February 12, 2015 7:00 AM
To: Corey Nolet
Cc: user
Subject: Re: Easy way to "partition" an RDD into chunks like Guava's 
Iterables.partition

No, only each group should need to fit.

On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet 
<cjno...@gmail.com<mailto:cjno...@gmail.com>> wrote:
Doesn't iter still need to fit entirely into memory?

On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>> wrote:
rdd.mapPartitions { iter =>
  val grouped = iter.grouped(batchSize)
  for (group <- grouped) { ... }
}

On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet 
<cjno...@gmail.com<mailto:cjno...@gmail.com>> wrote:
I think the word "partition" here is a tad different than the term "partition" 
that we use in Spark. Basically, I want something similar to Guava's 
Iterables.partition [1], that is, If I have an RDD[People] and I want to run an 
algorithm that can be optimized by working on 30 people at a time, I'd like to 
be able to say:

val rdd: RDD[People] = .....
val partitioned: RDD[Seq[People]] = rdd.partition(30)....

I also don't want any shuffling- everything can still be processed locally.


[1] 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)



Reply via email to