Dataset will have access to some of the catalyst/tungsten optimizations while also giving you scala and types. However that is currently experimental and not yet as efficient as it could be. On Feb 2, 2016 7:50 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:
> Sure, having a common distributed query and compute engine for all kind of > data source is alluring concept to market and advertise and to attract > potential customers (non engineers, analyst, data scientist). But it's > nothing new!..but darn old school. it's taking bits and pieces from > existing sql and no-sql technology. It lacks many panache of robust sql > engine. I think what put spark aside from everything else on market is RDD! > and flexibility and scala-like programming style given to developers which > is simply much more attractive to write then sql syntaxes, schema and > string constants that falls apart left and right. Writing sql is old > school. period. good luck making money though :) > > On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> To have a product databricks can charge for their sql engine needs to be >> competitive. That's why they have these optimizations in catalyst. RDD is >> simply no longer the focus. >> On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote: >> >>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly >>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert >>> physical rdd structure into byte array at some point for optimized GC and >>> memory. My question is why is it only applicable to SQL/Dataframe and not >>> RDD? RDD has types too! >>> >>> >>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com> >>> wrote: >>> >>>> I haven't gone through much details of spark catalyst optimizer and >>>> tungston project but we have been advised by databricks support to use >>>> DataFrame to resolve issues with OOM error that we are getting during Join >>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not >>>> perform external sort and blows with OOM. >>>> >>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html >>>> >>>> Now it's great that it has been addressed in spark 1.5 release but why >>>> databricks advocating to switch to DataFrames? It may make sense for batch >>>> jobs or near real-time jobs but not sure if they do when you are developing >>>> real time analytics where you want to optimize every millisecond that you >>>> can. Again I am still knowledging myself with DataFrame APIs and >>>> optimizations and I will benchmark it against RDD for our batch and >>>> real-time use case as well. >>>> >>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> What do you think is preventing you from optimizing your own RDD-level >>>>> transformations and actions? AFAIK, nothing that has been added in >>>>> Catalyst precludes you from doing that. The fact of the matter is, >>>>> though, >>>>> that there is less type and semantic information available to Spark from >>>>> the raw RDD API than from using Spark SQL, DataFrames or DataSets. That >>>>> means that Spark itself can't optimize for raw RDDs the same way that it >>>>> can for higher-level constructs that can leverage Catalyst; but if you >>>>> want >>>>> to write your own optimizations based on your own knowledge of the data >>>>> types and semantics that are hiding in your raw RDDs, there's no reason >>>>> that you can't do that. >>>>> >>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Perhaps I should write a blog about this that why spark is focusing >>>>>> more on writing easier spark jobs and hiding underlaying performance >>>>>> optimization details from a seasoned spark users. It's one thing to >>>>>> provide >>>>>> such abstract framework that does optimization for you so you don't have >>>>>> to >>>>>> worry about it as a data scientist or data analyst but what about >>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary >>>>>> abstractions ! Application designer who knows their data and queries >>>>>> should >>>>>> be able to optimize at RDD level transformations and actions. Does spark >>>>>> provides a way to achieve same level of optimization by using either SQL >>>>>> Catalyst or raw RDD transformation? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [image: What's New with Xactly] >>>>>> <http://www.xactlycorp.com/email-click/> >>>>>> >>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>> <http://www.youtube.com/xactlycorporation> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >> >> > > > > [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation>