Using RDDs requires some 'low level' optimization techniques. 
While using dataframes / Spark SQL allows you to leverage existing code. 

If you can share some more of your use case, that would help other people 
provide suggestions. 

Thanks

> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <takkarha...@gmail.com> wrote:
> 
> Hi Ted
> I am aware that rdd are immutable, but in my use case i need to update same 
> data set after each iteration.
> 
> Following are the points which i was exploring.
> 
> 1. Generating rdd in each iteration.( It might use a lot of memory).
> 2. Using Hive tables and update the same table after each iteration.
> 
> Please suggest,which one of the methods listed above will be good to use , or 
> is there are more better ways to accomplish it.
> 
> 
>> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote:
>> Please see the doc at the beginning of RDD class:
>> 
>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. 
>> Represents an immutable,
>>  * partitioned collection of elements that can be operated on in parallel. 
>> This class contains the
>>  * basic operations available on all RDDs, such as `map`, `filter`, and 
>> `persist`. In addition,
>> 
>>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com> wrote:
>>> Hi 
>>> 
>>> Is there a way i can modify a RDD, in for-each loop, 
>>> 
>>> Basically, i have a use case in which i need to perform multiple iteration 
>>> over data and modify few values in each iteration.
>>> 
>>> 
>>> Please help.

Reply via email to