Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
variable 'data' is a Scala collection and compute() returns an RDD. Right?
If yes, I tried the approach below to operate on one RDD only during the
whole computation (Yes, I also saw that too many RDD hurt performance).

Change compute() to return Scala collection instead of RDD.

    val result = sc.parallelize(data)        // Create and partition the
0.5M items in a single RDD.
      .flatMap(compute(_))   // You still have only one RDD with each item
joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <y...@yang-cs.com> wrote:

> Hi Mark,
>
> That's true, but in neither way can I combine the RDDs, so I have to avoid
> unions.
>
> Thanks,
> Yang
>
> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> RDD#union is not the same thing as SparkContext#union
>>
>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <y...@yang-cs.com> wrote:
>>
>>> Hi Noorul,
>>>
>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>> did some search and found some suggestions
>>> that we should try to avoid rdd.union(
>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>> ).
>>> I will try to come up with some other ways.
>>>
>>> Thank you,
>>> Yang
>>>
>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com>
>>> wrote:
>>>
>>>> sparkx <y...@yang-cs.com> writes:
>>>>
>>>> > Hi,
>>>> >
>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>> performs
>>>> > some sort of computation (joining a shared external dataset, if that
>>>> does
>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>> would like
>>>> > to combine all these RDDs and perform a next job. What I have found
>>>> out is
>>>> > that the computation itself is quite fast, but combining these RDDs
>>>> takes
>>>> > much longer time.
>>>> >
>>>> >     val result = data        // 0.5M data items
>>>> >       .map(compute(_))   // Produces an RDD - fast
>>>> >       .reduce(_ ++ _)      // Combining RDDs - slow
>>>> >
>>>> > I have also tried to collect results from compute(_) and use a
>>>> flatMap, but
>>>> > that is also slow.
>>>> >
>>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>>> > result to HDFS and reading from disk for the next job, but am not
>>>> sure if
>>>> > that's a preferred way in Spark.
>>>> >
>>>>
>>>> Are you looking for SparkContext.union() [1] ?
>>>>
>>>> This is not performing well with spark cassandra connector. I am not
>>>> sure whether this will help you.
>>>>
>>>> Thanks and Regards
>>>> Noorul
>>>>
>>>> [1]
>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>
>>>
>>>
>>>
>>> --
>>> Yang Chen
>>> Dept. of CISE, University of Florida
>>> Mail: y...@yang-cs.com
>>> Web: www.cise.ufl.edu/~yang
>>>
>>
>>
>
>
> --
> Yang Chen
> Dept. of CISE, University of Florida
> Mail: y...@yang-cs.com
> Web: www.cise.ufl.edu/~yang
>

Reply via email to