Using RDDs requires some 'low level' optimization techniques. While using dataframes / Spark SQL allows you to leverage existing code.
If you can share some more of your use case, that would help other people provide suggestions. Thanks > On May 6, 2016, at 6:57 PM, HARSH TAKKAR <takkarha...@gmail.com> wrote: > > Hi Ted > I am aware that rdd are immutable, but in my use case i need to update same > data set after each iteration. > > Following are the points which i was exploring. > > 1. Generating rdd in each iteration.( It might use a lot of memory). > 2. Using Hive tables and update the same table after each iteration. > > Please suggest,which one of the methods listed above will be good to use , or > is there are more better ways to accomplish it. > > >> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote: >> Please see the doc at the beginning of RDD class: >> >> * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. >> Represents an immutable, >> * partitioned collection of elements that can be operated on in parallel. >> This class contains the >> * basic operations available on all RDDs, such as `map`, `filter`, and >> `persist`. In addition, >> >>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com> wrote: >>> Hi >>> >>> Is there a way i can modify a RDD, in for-each loop, >>> >>> Basically, i have a use case in which i need to perform multiple iteration >>> over data and modify few values in each iteration. >>> >>> >>> Please help.