I'm not sure what we can do here. Nested RDDs are a pain to implement,
support, and explain. The programming model is not well explored.

Maybe a UDAF interface that allows going through the data twice?


On Mon, Sep 14, 2015 at 4:36 PM, sim <s...@swoop.com> wrote:

> I'd like to get some feedback on an API design issue pertaining to RDDs.
>
> The design goal to avoid RDD nesting, which I agree with, leads the methods
> operating on subsets of an RDD (not necessarily partitions) to use Iterable
> as an abstraction. The mapPartitions and groupBy* family of methods are
> good
> examples. The problem with that API choice is that developers often very
> quickly run out of the benefits of the RDD API, independent of
> partitioning.
>
> Consider two very simple problems that demonstrate the issue. The input is
> the same for all: an RDD of integers that has been grouped into odd and
> even.
>
> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
> sampling (sampleByKey) is built into PairRDDFunctions.
>
> 2. Sample at 10% if there are more than 1,000 elements in a group and at
> 20%
> otherwise. Suddenly, the problem becomes a lot less easy. The sub-groups
> are
> no longer RDDs and we can't use the RDD sampling API.
>
> Note that the only reason the first problem is easy is because it was part
> of Spark. If that hadn't happened, implementing it with the higher-level
> API
> abstractions wouldn't have been easy. As more an more people use Spark for
> ever more diverse sets of problems the likelihood that the RDD APIs provide
> pre-existing high-level abstractions will diminish.
>
> How do you feel about this? Do you think it is desirable to lose all
> high-level RDD API abstractions the very moment we group an RDD or call
> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
> high-level abstractions that we can expose via the Iterables borne of RDDs?
>
> I'd love your thoughts.
>
> /Sim
> http://linkedin.com/in/simeons <http://linkedin.com/in/simeons>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to