Hi Sim, I understand that what you propose is defining a trait SparkIterable (and also PairSparkIterable for RDDs of pairs) that encapsulates the methods in RDDs, and then program using that trait instead of RDD. That is similar to programming using scala.collection.GenSeq to abstract from using a sequential or parallel Seq. This new trait SparkIterable would be needed to cover methods in RDDs that are not present in GenSeq and other standard traits. I understand you suggest implementing it using wrapper classes and implicit conversions, like in PairRDDFunctions, in order to see both RDD, Iterable and other classes as SparkIterable. That reminds me of type classes http://danielwestheide.com/blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.html, which could be a similar approach. I think it would be interesting to know if some standard type classes like for example those in https://non.github.io/cats//typeclasses.html could be of use here.
A downside I find in this approach is that it would be more difficult to reason about the performance of programs, and to write them to obtain the best performance, if we don't know whether a SparkIterable in a distributed RDD or a node local collection, that for example might even be indexed. Or we might avoid accessing a SparkIterable from a closure in a map because we don't know if we are in the driver or in a worker. That could difficult the development of efficient programs, but this is not very surprising because the trade off because abstraction level and performance is always there in programming anyway. Anyway I find your idea very interesting, I think it could be developed into a nice library Greetings, Juan 2015-09-18 14:55 GMT+02:00 sim <s...@swoop.com>: > @debasish83, yes, there are many ways to optimize and work around the > limitation of no nested RDDs. The point of this thread is to discuss the > API > patterns of Spark in order to make the platform more accessible to lots of > developers solving interesting problems quickly. We can get API consistency > without resorting to simulations of nested RDDs. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14195.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >