Andrew, Thank you. I'm using mapPartitions() but as you say, it requires that every partition fit in memory. This will work for now but may not always work so I was wondering about another way.
Thanks, Roger On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <and...@andrewash.com> wrote: > Hi Roger, > > You should be able to sort within partitions using the rdd.mapPartitions() > method, and that shouldn't require holding all data in memory at once. It > does require holding the entire partition in memory though. Do you need > the partition to never be held in memory all at once? > > As far as the work that Aaron mentioned is happening, I think he might be > referring to the discussion and code surrounding > https://issues.apache.org/jira/browse/SPARK-983 > > Cheers! > Andrew > > > On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > >> I think it would very handy to be able to specify that you want sorting >> during a partitioning stage. >> >> >> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoo...@gmail.com> >> wrote: >> >>> Hi Aaron, >>> >>> When you say that sorting is being worked on, can you elaborate a little >>> more please? >>> >>> If particular, I want to sort the items within each partition (not >>> globally) without necessarily bringing them all into memory at once. >>> >>> Thanks, >>> >>> Roger >>> >>> >>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com> >>> wrote: >>> >>>> There is no fundamental issue if you're running on data that is larger >>>> than cluster memory size. Many operations can stream data through, and thus >>>> memory usage is independent of input data size. Certain operations require >>>> an entire *partition* (not dataset) to fit in memory, but there are not >>>> many instances of this left (sorting comes to mind, and this is being >>>> worked on). >>>> >>>> In general, one problem with Spark today is that you *can* OOM under >>>> certain configurations, and it's possible you'll need to change from the >>>> default configuration if you're using doing very memory-intensive jobs. >>>> However, there are very few cases where Spark would simply fail as a matter >>>> of course *-- *for instance, you can always increase the number of >>>> partitions to decrease the size of any given one. or repartition data to >>>> eliminate skew. >>>> >>>> Regarding impact on performance, as Mayur said, there may absolutely be >>>> an impact depending on your jobs. If you're doing a join on a very large >>>> amount of data with few partitions, then we'll have to spill to disk. If >>>> you can't cache your working set of data in memory, you will also see a >>>> performance degradation. Spark enables the use of memory to make things >>>> fast, but if you just don't have enough memory, it won't be terribly fast. >>>> >>>> >>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi < >>>> mayur.rust...@gmail.com> wrote: >>>> >>>>> Clearly thr will be impact on performance but frankly depends on what >>>>> you are trying to achieve with the dataset. >>>>> >>>>> Mayur Rustagi >>>>> Ph: +1 (760) 203 3257 >>>>> http://www.sigmoidanalytics.com >>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>> >>>>> >>>>> >>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com> >>>>> wrote: >>>>> >>>>>> Some inputs will be really helpful. >>>>>> >>>>>> Thanks, >>>>>> -Vibhor >>>>>> >>>>>> >>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I am planning to use spark with HBase, where I generate RDD by >>>>>>> reading data from HBase Table. >>>>>>> >>>>>>> I want to know that in the case when the size of HBase Table grows >>>>>>> larger than the size of RAM available in the cluster, will the >>>>>>> application >>>>>>> fail, or will there be an impact in performance ? >>>>>>> >>>>>>> Any thoughts in this direction will be helpful and are welcome. >>>>>>> >>>>>>> Thanks, >>>>>>> -Vibhor >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Vibhor Banga >>>>>> Software Development Engineer >>>>>> Flipkart Internet Pvt. Ltd., Bangalore >>>>>> >>>>>> >>>>> >>>> >>> >> >