Re: Using Spark on Data size larger than Memory size

Roger Hoover Fri, 06 Jun 2014 17:48:07 -0700

Andrew,

Thank you.  I'm using mapPartitions() but as you say, it requires that
every partition fit in memory.  This will work for now but may not always
work so I was wondering about another way.


Thanks,

Roger


On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Roger,
>
> You should be able to sort within partitions using the rdd.mapPartitions()
> method, and that shouldn't require holding all data in memory at once.  It
> does require holding the entire partition in memory though.  Do you need
> the partition to never be held in memory all at once?
>
> As far as the work that Aaron mentioned is happening, I think he might be
> referring to the discussion and code surrounding
> https://issues.apache.org/jira/browse/SPARK-983
>
> Cheers!
> Andrew
>
>
> On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <roger.hoo...@gmail.com>
> wrote:
>
>> I think it would very handy to be able to specify that you want sorting
>> during a partitioning stage.
>>
>>
>> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoo...@gmail.com>
>> wrote:
>>
>>> Hi Aaron,
>>>
>>> When you say that sorting is being worked on, can you elaborate a little
>>> more please?
>>>
>>> If particular, I want to sort the items within each partition (not
>>> globally) without necessarily bringing them all into memory at once.
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com>
>>> wrote:
>>>
>>>> There is no fundamental issue if you're running on data that is larger
>>>> than cluster memory size. Many operations can stream data through, and thus
>>>> memory usage is independent of input data size. Certain operations require
>>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>>> many instances of this left (sorting comes to mind, and this is being
>>>> worked on).
>>>>
>>>> In general, one problem with Spark today is that you *can* OOM under
>>>> certain configurations, and it's possible you'll need to change from the
>>>> default configuration if you're using doing very memory-intensive jobs.
>>>> However, there are very few cases where Spark would simply fail as a matter
>>>> of course *-- *for instance, you can always increase the number of
>>>> partitions to decrease the size of any given one. or repartition data to
>>>> eliminate skew.
>>>>
>>>> Regarding impact on performance, as Mayur said, there may absolutely be
>>>> an impact depending on your jobs. If you're doing a join on a very large
>>>> amount of data with few partitions, then we'll have to spill to disk. If
>>>> you can't cache your working set of data in memory, you will also see a
>>>> performance degradation. Spark enables the use of memory to make things
>>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <
>>>> mayur.rust...@gmail.com> wrote:
>>>>
>>>>> Clearly thr will be impact on performance but frankly depends on what
>>>>> you are trying to achieve with the dataset.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Some inputs will be really helpful.
>>>>>>
>>>>>> Thanks,
>>>>>> -Vibhor
>>>>>>
>>>>>>
>>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>>> reading data from HBase Table.
>>>>>>>
>>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>>> larger than the size of RAM available in the cluster, will the 
>>>>>>> application
>>>>>>> fail, or will there be an impact in performance ?
>>>>>>>
>>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Vibhor
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vibhor Banga
>>>>>> Software Development Engineer
>>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Reply via email to