Juan, I wouldn't go as far as suggesting we switch from programming using RDDs to using SparkIterable. For example, all methods involving context, jobs or partitions should only be part of the RDD API and not part of SparkIterable. That said, the Spark community would benefit from a consistent set of APIs for both RDDs and Iterables inside RDDs.
You raise an important point about performance analysis & guarantees. Reasoning about performance should not be any more complicated than reasoning about the performance of the code that works with Iterables generated by mapPartitions or groupByKey today. However, it is important to not confuse users about what object they are working with: an RDD, which supports the SparkIterable API, vs. an iterable part of an RDD, which also supports the SparkIterable API (e.g., one that mapPartitions generates). Therefore, RDD transformation APIs should continue to return RDDs, as they do today. Thank you for your implementation pointers. The Scala type system is certainly flexible enough to support SparkIterable. If we get more consensus that this is a good direction, I'd love to do a Skype session with you to evaluate implementation options. Best, Sim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14222.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org