Re: RDD API patterns

sim Sat, 19 Sep 2015 17:04:22 -0700

Juan, I wouldn't go as far as suggesting we switch from programming using
RDDs to using SparkIterable. For example, all methods involving context,
jobs or partitions should only be part of the RDD API and not part of
SparkIterable. That said, the Spark community would benefit from a
consistent set of APIs for both RDDs and Iterables inside RDDs.


You raise an important point about performance analysis & guarantees.
Reasoning about performance should not be any more complicated than
reasoning about the performance of the code that works with Iterables
generated by mapPartitions or groupByKey today. 

However, it is important to not confuse users about what object they are
working with: an RDD, which supports the SparkIterable API, vs. an iterable
part of an RDD, which also supports the SparkIterable API (e.g., one that
mapPartitions generates). Therefore, RDD transformation APIs should continue
to return RDDs, as they do today.

Thank you for your implementation pointers. The Scala type system is
certainly flexible enough to support SparkIterable. If we get more consensus
that this is a good direction, I'd love to do a Skype session with you to
evaluate implementation options.

Best,
Sim



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14222.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: RDD API patterns

Reply via email to