I need some advices regarding how data are stored in an RDD. I have millions of records, called "Measures". They are bucketed with keys of String type. I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String, Iterable[Measure])], and why?
Data in each bucket are not related most of the time. The operations that I often needs to do are: - Sort the Measures in each bucket separately - Aggregate the Measures in each bucket separately - Combine Measures in two RDDs into one based on their bucket keys -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-better-data-structure-in-an-RDD-tp13159.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org