Hi Nirav, I don't know why those optimizations are not implemented in RDD. It is either a political choice or a practical one (backward compatibility might be difficult if they need to introduce these types of optimization into RDD). I think this is the reasons spark now has Dataset. My understanding is that Dataset is the new RDD.
Best Regards, Jerry Sent from my iPhone > On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote: > > with respect to joins, unfortunately not all implementations are available. > for example i would like to use joins where one side is streaming (and the > other cached). this seems to be available for DataFrame but not for RDD. > >> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <npa...@xactlycorp.com> wrote: >> Hi Jerry, >> >> Yes I read that benchmark. And doesn't help in most cases. I'll give you >> example of one of our application. It's a memory hogger by nature since it >> works on groupByKey and performs combinatorics on Iterator. So it maintain >> few structures inside task. It works on mapreduce with half the resources I >> am giving it for spark and Spark keeps throwing OOM on a pre-step which is a >> simple join! I saw every task was done at process_local locality still join >> keeps failing due to container being killed. and container gets killed due >> to oom. We have a case with Databricks/Mapr on that for more then a month. >> anyway don't wanna distract there. I can believe that changing to DataFrame >> and it's computing model can bring performance but I was hoping that >> wouldn't be your answer to every performance problem. >> >> Let me ask this - If I decide to stick with RDD do I still have flexibility >> to choose what Join implementation I can use? And similar underlaying >> construct to best execute my jobs. >> >> I said error prone because you need to write column qualifiers instead of >> referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1'; >> more over multiple tables having similar column names causing parsing >> issues; and when you start writing constants for your columns it just become >> another schema maintenance inside your app. It feels like thing of past. >> Query engine(distributed or not) is old school as I 'see' it :) >> >> Thanks for being patient. >> Nirav >> >> >> >> >> >>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <chiling...@gmail.com> wrote: >>> Hi Nirav, >>> I'm sure you read this? >>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html >>> >>> There is a benchmark in the article to show that dataframe "can" outperform >>> RDD implementation by 2 times. Of course, benchmarks can be "made". But >>> from the code snippet you wrote, I "think" dataframe will choose between >>> different join implementation based on the data statistics. >>> >>> I cannot comment on the beauty of it because "beauty is in the eye of the >>> beholder" LOL >>> Regarding the comment on error prone, can you say why you think it is the >>> case? Relative to what other ways? >>> >>> Best Regards, >>> >>> Jerry >>> >>> >>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote: >>>> I dont understand why one thinks RDD of case object doesn't have >>>> types(schema) ? If spark can convert RDD to DataFrame which means it >>>> understood the schema. SO then from that point why one has to use SQL >>>> features to do further processing? If all spark need for optimizations is >>>> schema then what this additional SQL features buys ? If there is a way to >>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have >>>> to convert all my existing transformation to things like >>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly >>>> and error prone in my opinion. >>>> >>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> wrote: >>>>> Hi Michael, >>>>> >>>>> Is there a section in the spark documentation demonstrate how to >>>>> serialize arbitrary objects in Dataframe? The last time I did was using >>>>> some User Defined Type (copy from VectorUDT). >>>>> >>>>> Best Regards, >>>>> >>>>> Jerry >>>>> >>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mich...@databricks.com> >>>>> wrote: >>>>>>> A principal difference between RDDs and DataFrames/Datasets is that the >>>>>>> latter have a schema associated to them. This means that they support >>>>>>> only certain types (primitives, case classes and more) and that they >>>>>>> are uniform, whereas RDDs can contain any serializable object and must >>>>>>> not necessarily be uniform. These properties make it possible to >>>>>>> generate very efficient serialization and other optimizations that >>>>>>> cannot be achieved with plain RDDs. >>>>>> >>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just >>>>>> like with RDDs. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >> >> >> >> >> >> >