Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Nirav Patel Tue, 02 Feb 2016 16:50:51 -0800

Sure, having a common distributed query and compute engine for all kind of
data source is alluring concept to market and advertise and to attract
potential customers (non engineers, analyst, data scientist). But it's
nothing new!..but darn old school. it's taking bits and pieces from
existing sql and no-sql technology. It lacks many panache of robust sql
engine. I think what put spark aside from everything else on market is RDD!
and flexibility and scala-like programming style given to developers which
is simply much more attractive to write then sql syntaxes, schema and
string constants that falls apart left and right. Writing sql is old
school. period.  good luck making money though :)


On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:

> To have a product databricks can charge for their sql engine needs to be
> competitive. That's why they have these optimizations in catalyst. RDD is
> simply no longer the focus.
> On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>
>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>> physical rdd structure into byte array at some point for optimized GC and
>> memory. My question is why is it only applicable to SQL/Dataframe and not
>> RDD? RDD has types too!
>>
>>
>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> I haven't gone through much details of spark catalyst optimizer and
>>> tungston project but we have been advised by databricks support to use
>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>> perform external sort and blows with OOM.
>>>
>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>
>>> Now it's great that it has been addressed in spark 1.5 release but why
>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>> jobs or near real-time jobs but not sure if they do when you are developing
>>> real time analytics where you want to optimize every millisecond that you
>>> can. Again I am still knowledging myself with DataFrame APIs and
>>> optimizations and I will benchmark it against RDD for our batch and
>>> real-time use case as well.
>>>
>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> What do you think is preventing you from optimizing your own RDD-level
>>>> transformations and actions?  AFAIK, nothing that has been added in
>>>> Catalyst precludes you from doing that.  The fact of the matter is, though,
>>>> that there is less type and semantic information available to Spark from
>>>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>>>> means that Spark itself can't optimize for raw RDDs the same way that it
>>>> can for higher-level constructs that can leverage Catalyst; but if you want
>>>> to write your own optimizations based on your own knowledge of the data
>>>> types and semantics that are hiding in your raw RDDs, there's no reason
>>>> that you can't do that.
>>>>
>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>> optimization details from a seasoned spark users. It's one thing to 
>>>>> provide
>>>>> such abstract framework that does optimization for you so you don't have 
>>>>> to
>>>>> worry about it as a data scientist or data analyst but what about
>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>>>> abstractions ! Application designer who knows their data and queries 
>>>>> should
>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>> Catalyst or raw RDD transformation?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to