I agree that this in issue but I am afraid supporting RDD nesting would be
hard and perhaps would need rearchitecting Spark. For now, you may to use
workarounds like storing each group in a separate file, process each file
as separate RDD and finally merge results in a single RDD.

I know its painful and I share the pain :)

Thanks,
Aniket

On Tue, Sep 15, 2015, 5:06 AM sim [via Apache Spark Developers List] <
ml-node+s1001551n14116...@n3.nabble.com> wrote:

> I'd like to get some feedback on an API design issue pertaining to RDDs.
>
> The design goal to avoid RDD nesting, which I agree with, leads the
> methods operating on subsets of an RDD (not necessarily partitions) to use
> Iterable as an abstraction. The mapPartitions and groupBy* family of
> methods are good examples. The problem with that API choice is that
> developers often very quickly run out of the benefits of the RDD API,
> independent of partitioning.
>
> Consider two very simple problems that demonstrate the issue. The input is
> the same for all: an RDD of integers that has been grouped into odd and
> even.
>
> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
> sampling (sampleByKey) is built into PairRDDFunctions.
>
> 2. Sample at 10% if there are more than 1,000 elements in a group and at
> 20% otherwise. Suddenly, the problem becomes a lot less easy. The
> sub-groups are no longer RDDs and we can't use the RDD sampling API.
>
> Note that the only reason the first problem is easy is because it was part
> of Spark. If that hadn't happened, implementing it with the higher-level
> API abstractions wouldn't have been easy. As more an more people use Spark
> for ever more diverse sets of problems the likelihood that the RDD APIs
> provide pre-existing high-level abstractions will diminish.
>
> How do you feel about this? Do you think it is desirable to lose all
> high-level RDD API abstractions the very moment we group an RDD or call
> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
> high-level abstractions that we can expose via the Iterables borne of RDDs?
>
> I'd love your thoughts.
>
> /Sim
> http://linkedin.com/in/simeons
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14146.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to