Re: RDD API patterns

Juan Rodríguez Hortalá Sat, 19 Sep 2015 13:12:06 -0700

Hi Sim,

I understand that what you propose is defining a trait SparkIterable (and
also PairSparkIterable for RDDs of pairs) that encapsulates the methods in
RDDs, and then program using that trait instead of RDD. That is similar to
programming using scala.collection.GenSeq to abstract from using a
sequential or parallel Seq. This new trait SparkIterable would be needed to
cover methods in RDDs that are not present in GenSeq and other standard
traits. I understand you suggest implementing it using wrapper classes and
implicit conversions, like in PairRDDFunctions, in order to see both RDD,
Iterable and other classes as SparkIterable. That reminds me of type
classes
http://danielwestheide.com/blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.html,
which could be a similar approach. I think it would be interesting to know
if some standard type classes like for example those in
https://non.github.io/cats//typeclasses.html could be of use here.


A downside I find in this approach is that it would be more difficult to
reason about the performance of programs, and to write them to obtain the
best performance, if we don't know whether a SparkIterable in a distributed
RDD or a node local collection, that for example might even be indexed. Or
we might avoid accessing a SparkIterable from a closure in a map because we
don't know if we are in the driver or in a worker. That could difficult the
development of efficient programs, but this is not very surprising because
the trade off because abstraction level and performance is always there in
programming anyway.

Anyway I find your idea very interesting, I think it could be developed
into a nice library

Greetings,

Juan




2015-09-18 14:55 GMT+02:00 sim <s...@swoop.com>:

> @debasish83, yes, there are many ways to optimize and work around the
> limitation of no nested RDDs. The point of this thread is to discuss the
> API
> patterns of Spark in order to make the platform more accessible to lots of
> developers solving interesting problems quickly. We can get API consistency
> without resorting to simulations of nested RDDs.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14195.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RDD API patterns

Reply via email to