I'd like to get some feedback on an API design issue pertaining to RDDs. The design goal to avoid RDD nesting, which I agree with, leads the methods operating on subsets of an RDD (not necessarily partitions) to use Iterable as an abstraction. The mapPartitions and groupBy* family of methods are good examples. The problem with that API choice is that developers often very quickly run out of the benefits of the RDD API, independent of partitioning.
Consider two very simple problems that demonstrate the issue. The input is the same for all: an RDD of integers that has been grouped into odd and even. 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified sampling (sampleByKey) is built into PairRDDFunctions. 2. Sample at 10% if there are more than 1,000 elements in a group and at 20% otherwise. Suddenly, the problem becomes a lot less easy. The sub-groups are no longer RDDs and we can't use the RDD sampling API. Note that the only reason the first problem is easy is because it was part of Spark. If that hadn't happened, implementing it with the higher-level API abstractions wouldn't have been easy. As more an more people use Spark for ever more diverse sets of problems the likelihood that the RDD APIs provide pre-existing high-level abstractions will diminish. How do you feel about this? Do you think it is desirable to lose all high-level RDD API abstractions the very moment we group an RDD or call mapPartitions? Does the goal of no nested RDDs mean there are absolutely no high-level abstractions that we can expose via the Iterables borne of RDDs? I'd love your thoughts. /Sim http://linkedin.com/in/simeons <http://linkedin.com/in/simeons> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org