I need some advices regarding how data are stored in an RDD.  I have millions
of records, called "Measures".  They are bucketed with keys of String type. 
I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String,
Iterable[Measure])], and why?

Data in each bucket are not related most of the time.  The operations that I
often needs to do are:

- Sort the Measures in each bucket separately
- Aggregate the Measures in each bucket separately
- Combine Measures in two RDDs into one based on their bucket keys






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-better-data-structure-in-an-RDD-tp13159.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to