Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance).
Change compute() to return Scala collection instead of RDD. val result = sc.parallelize(data) // Create and partition the 0.5M items in a single RDD. .flatMap(compute(_)) // You still have only one RDD with each item joined with external data already Hope this help. Kelvin On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen <y...@yang-cs.com> wrote: > Hi Mark, > > That's true, but in neither way can I combine the RDDs, so I have to avoid > unions. > > Thanks, > Yang > > On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> RDD#union is not the same thing as SparkContext#union >> >> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <y...@yang-cs.com> wrote: >> >>> Hi Noorul, >>> >>> Thank you for your suggestion. I tried that, but ran out of memory. I >>> did some search and found some suggestions >>> that we should try to avoid rdd.union( >>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark >>> ). >>> I will try to come up with some other ways. >>> >>> Thank you, >>> Yang >>> >>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> >>> wrote: >>> >>>> sparkx <y...@yang-cs.com> writes: >>>> >>>> > Hi, >>>> > >>>> > I have a Spark job and a dataset of 0.5 Million items. Each item >>>> performs >>>> > some sort of computation (joining a shared external dataset, if that >>>> does >>>> > matter) and produces an RDD containing 20-500 result items. Now I >>>> would like >>>> > to combine all these RDDs and perform a next job. What I have found >>>> out is >>>> > that the computation itself is quite fast, but combining these RDDs >>>> takes >>>> > much longer time. >>>> > >>>> > val result = data // 0.5M data items >>>> > .map(compute(_)) // Produces an RDD - fast >>>> > .reduce(_ ++ _) // Combining RDDs - slow >>>> > >>>> > I have also tried to collect results from compute(_) and use a >>>> flatMap, but >>>> > that is also slow. >>>> > >>>> > Is there a way to efficiently do this? I'm thinking about writing this >>>> > result to HDFS and reading from disk for the next job, but am not >>>> sure if >>>> > that's a preferred way in Spark. >>>> > >>>> >>>> Are you looking for SparkContext.union() [1] ? >>>> >>>> This is not performing well with spark cassandra connector. I am not >>>> sure whether this will help you. >>>> >>>> Thanks and Regards >>>> Noorul >>>> >>>> [1] >>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >>>> >>> >>> >>> >>> -- >>> Yang Chen >>> Dept. of CISE, University of Florida >>> Mail: y...@yang-cs.com >>> Web: www.cise.ufl.edu/~yang >>> >> >> > > > -- > Yang Chen > Dept. of CISE, University of Florida > Mail: y...@yang-cs.com > Web: www.cise.ufl.edu/~yang >