Awesome! I just read design docs. That is EXACTLY what I was talking about! Looking forward to it!
Thanks On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers <ko...@tresata.com> wrote: > yeah there was some discussion about adding them to RDD, but it would > break a lot. so Dataset was born. > > yes it seems Dataset will be the new RDD for most use cases. but i dont > think its there yet. just keep an eye out on SPARK-9999 for updates... > > On Wed, Feb 3, 2016 at 8:51 AM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Nirav, >> >> I don't know why those optimizations are not implemented in RDD. It is >> either a political choice or a practical one (backward compatibility might >> be difficult if they need to introduce these types of optimization into >> RDD). I think this is the reasons spark now has Dataset. My understanding >> is that Dataset is the new RDD. >> >> >> Best Regards, >> >> Jerry >> >> Sent from my iPhone >> >> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote: >> >> with respect to joins, unfortunately not all implementations are >> available. for example i would like to use joins where one side is >> streaming (and the other cached). this seems to be available for DataFrame >> but not for RDD. >> >> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <npa...@xactlycorp.com> >> wrote: >> >>> Hi Jerry, >>> >>> Yes I read that benchmark. And doesn't help in most cases. I'll give you >>> example of one of our application. It's a memory hogger by nature since it >>> works on groupByKey and performs combinatorics on Iterator. So it maintain >>> few structures inside task. It works on mapreduce with half the resources I >>> am giving it for spark and Spark keeps throwing OOM on a pre-step which is >>> a simple join! I saw every task was done at process_local locality still >>> join keeps failing due to container being killed. and container gets killed >>> due to oom. We have a case with Databricks/Mapr on that for more then a >>> month. anyway don't wanna distract there. I can believe that changing to >>> DataFrame and it's computing model can bring performance but I was hoping >>> that wouldn't be your answer to every performance problem. >>> >>> Let me ask this - If I decide to stick with RDD do I still have >>> flexibility to choose what Join implementation I can use? And similar >>> underlaying construct to best execute my jobs. >>> >>> I said error prone because you need to write column qualifiers instead >>> of referencing fields. i.e. 'caseObj("field1")' instead of >>> 'caseObj.field1'; more over multiple tables having similar column names >>> causing parsing issues; and when you start writing constants for your >>> columns it just become another schema maintenance inside your app. It feels >>> like thing of past. Query engine(distributed or not) is old school as I >>> 'see' it :) >>> >>> Thanks for being patient. >>> Nirav >>> >>> >>> >>> >>> >>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <chiling...@gmail.com> wrote: >>> >>>> Hi Nirav, >>>> I'm sure you read this? >>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html >>>> >>>> There is a benchmark in the article to show that dataframe "can" >>>> outperform RDD implementation by 2 times. Of course, benchmarks can be >>>> "made". But from the code snippet you wrote, I "think" dataframe will >>>> choose between different join implementation based on the data statistics. >>>> >>>> I cannot comment on the beauty of it because "beauty is in the eye of >>>> the beholder" LOL >>>> Regarding the comment on error prone, can you say why you think it is >>>> the case? Relative to what other ways? >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>>> >>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> >>>> wrote: >>>> >>>>> I dont understand why one thinks RDD of case object doesn't have >>>>> types(schema) ? If spark can convert RDD to DataFrame which means it >>>>> understood the schema. SO then from that point why one has to use SQL >>>>> features to do further processing? If all spark need for optimizations is >>>>> schema then what this additional SQL features buys ? If there is a way to >>>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have >>>>> to >>>>> convert all my existing transformation to things like >>>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly >>>>> and error prone in my opinion. >>>>> >>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Michael, >>>>>> >>>>>> Is there a section in the spark documentation demonstrate how to >>>>>> serialize arbitrary objects in Dataframe? The last time I did was using >>>>>> some User Defined Type (copy from VectorUDT). >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Jerry >>>>>> >>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust < >>>>>> mich...@databricks.com> wrote: >>>>>> >>>>>>> A principal difference between RDDs and DataFrames/Datasets is that >>>>>>>> the latter have a schema associated to them. This means that they >>>>>>>> support >>>>>>>> only certain types (primitives, case classes and more) and that they >>>>>>>> are >>>>>>>> uniform, whereas RDDs can contain any serializable object and must not >>>>>>>> necessarily be uniform. These properties make it possible to generate >>>>>>>> very >>>>>>>> efficient serialization and other optimizations that cannot be achieved >>>>>>>> with plain RDDs. >>>>>>>> >>>>>>> >>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, >>>>>>> just like with RDDs. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> [image: What's New with Xactly] >>>>> <http://www.xactlycorp.com/email-click/> >>>>> >>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>> <http://www.youtube.com/xactlycorporation> >>>>> >>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >>> >> >> > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>