Re: RDD API patterns

Juan Rodríguez Hortalá Wed, 16 Sep 2015 03:07:37 -0700

Hi,

That reminds me to a previous discussion about splitting an RDD into
several RDDs
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-split-into-multiple-RDDs-td11877.html.
There you can see a simple code to convert RDD[(K, V)] into Map[K, RDD[V]]
through several filters. On top of that maybe you could build an
abstraction that simulates nested RDDs, as a proof of concepts, forgetting
for now about performance. But the main problem I've found is that the
Spark scheduler gets stuck when you have a huge amount of very small RDDs,
or at least that is what happened several versions ago
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccamassdj+bzv++cr44edv-cpchr-1x-a+y2vmtugwc0ux91f...@mail.gmail.com%3E


Just my two cents





2015-09-16 11:51 GMT+02:00 Aniket <aniket.bhatna...@gmail.com>:

> I agree that this in issue but I am afraid supporting RDD nesting would be
> hard and perhaps would need rearchitecting Spark. For now, you may to use
> workarounds like storing each group in a separate file, process each file
> as separate RDD and finally merge results in a single RDD.
>
> I know its painful and I share the pain :)
>
> Thanks,
> Aniket
>
> On Tue, Sep 15, 2015, 5:06 AM sim [via Apache Spark Developers List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=14146&i=0>> wrote:
>
>> I'd like to get some feedback on an API design issue pertaining to RDDs.
>>
>> The design goal to avoid RDD nesting, which I agree with, leads the
>> methods operating on subsets of an RDD (not necessarily partitions) to use
>> Iterable as an abstraction. The mapPartitions and groupBy* family of
>> methods are good examples. The problem with that API choice is that
>> developers often very quickly run out of the benefits of the RDD API,
>> independent of partitioning.
>>
>> Consider two very simple problems that demonstrate the issue. The input
>> is the same for all: an RDD of integers that has been grouped into odd and
>> even.
>>
>> 1. Sample the odds at 10% and the evens at 20%. Trivial, as stratified
>> sampling (sampleByKey) is built into PairRDDFunctions.
>>
>> 2. Sample at 10% if there are more than 1,000 elements in a group and at
>> 20% otherwise. Suddenly, the problem becomes a lot less easy. The
>> sub-groups are no longer RDDs and we can't use the RDD sampling API.
>>
>> Note that the only reason the first problem is easy is because it was
>> part of Spark. If that hadn't happened, implementing it with the
>> higher-level API abstractions wouldn't have been easy. As more an more
>> people use Spark for ever more diverse sets of problems the likelihood that
>> the RDD APIs provide pre-existing high-level abstractions will diminish.
>>
>> How do you feel about this? Do you think it is desirable to lose all
>> high-level RDD API abstractions the very moment we group an RDD or call
>> mapPartitions? Does the goal of no nested RDDs mean there are absolutely no
>> high-level abstractions that we can expose via the Iterables borne of RDDs?
>>
>> I'd love your thoughts.
>>
>> /Sim
>> http://linkedin.com/in/simeons
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116.html
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=14146&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> ------------------------------
> View this message in context: Re: RDD API patterns
> <http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14146.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Re: RDD API patterns

Reply via email to