Re: Using Spark on Data size larger than Memory size

Andrew Ash Fri, 06 Jun 2014 23:37:25 -0700

If an individual partition becomes too large to fit in memory then the
usual approach would be to repartition to more partitions, so each one is
smaller. Hopefully then it would fit.
On Jun 6, 2014 5:47 PM, "Roger Hoover" <roger.hoo...@gmail.com> wrote:


> Andrew,
>
> Thank you.  I'm using mapPartitions() but as you say, it requires that
> every partition fit in memory.  This will work for now but may not always
> work so I was wondering about another way.
>
> Thanks,
>
> Roger
>
>
> On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Hi Roger,
>>
>> You should be able to sort within partitions using the
>> rdd.mapPartitions() method, and that shouldn't require holding all data in
>> memory at once.  It does require holding the entire partition in memory
>> though.  Do you need the partition to never be held in memory all at once?
>>
>> As far as the work that Aaron mentioned is happening, I think he might be
>> referring to the discussion and code surrounding
>> https://issues.apache.org/jira/browse/SPARK-983
>>
>> Cheers!
>> Andrew
>>
>>
>> On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <roger.hoo...@gmail.com>
>> wrote:
>>
>>> I think it would very handy to be able to specify that you want sorting
>>> during a partitioning stage.
>>>
>>>
>>> On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <roger.hoo...@gmail.com>
>>> wrote:
>>>
>>>> Hi Aaron,
>>>>
>>>> When you say that sorting is being worked on, can you elaborate a
>>>> little more please?
>>>>
>>>> If particular, I want to sort the items within each partition (not
>>>> globally) without necessarily bringing them all into memory at once.
>>>>
>>>> Thanks,
>>>>
>>>> Roger
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <ilike...@gmail.com>
>>>> wrote:
>>>>
>>>>> There is no fundamental issue if you're running on data that is larger
>>>>> than cluster memory size. Many operations can stream data through, and 
>>>>> thus
>>>>> memory usage is independent of input data size. Certain operations require
>>>>> an entire *partition* (not dataset) to fit in memory, but there are not
>>>>> many instances of this left (sorting comes to mind, and this is being
>>>>> worked on).
>>>>>
>>>>> In general, one problem with Spark today is that you *can* OOM under
>>>>> certain configurations, and it's possible you'll need to change from the
>>>>> default configuration if you're using doing very memory-intensive jobs.
>>>>> However, there are very few cases where Spark would simply fail as a 
>>>>> matter
>>>>> of course *-- *for instance, you can always increase the number of
>>>>> partitions to decrease the size of any given one. or repartition data to
>>>>> eliminate skew.
>>>>>
>>>>> Regarding impact on performance, as Mayur said, there may absolutely
>>>>> be an impact depending on your jobs. If you're doing a join on a very 
>>>>> large
>>>>> amount of data with few partitions, then we'll have to spill to disk. If
>>>>> you can't cache your working set of data in memory, you will also see a
>>>>> performance degradation. Spark enables the use of memory to make things
>>>>> fast, but if you just don't have enough memory, it won't be terribly fast.
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <
>>>>> mayur.rust...@gmail.com> wrote:
>>>>>
>>>>>> Clearly thr will be impact on performance but frankly depends on what
>>>>>> you are trying to achieve with the dataset.
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <vibhorba...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Some inputs will be really helpful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Vibhor
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <vibhorba...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am planning to use spark with HBase, where I generate RDD by
>>>>>>>> reading data from HBase Table.
>>>>>>>>
>>>>>>>> I want to know that in the case when the size of HBase Table grows
>>>>>>>> larger than the size of RAM available in the cluster, will the 
>>>>>>>> application
>>>>>>>> fail, or will there be an impact in performance ?
>>>>>>>>
>>>>>>>> Any thoughts in this direction will be helpful and are welcome.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Vibhor
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vibhor Banga
>>>>>>> Software Development Engineer
>>>>>>> Flipkart Internet Pvt. Ltd., Bangalore
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using Spark on Data size larger than Memory size

Reply via email to