Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Koert Kuipers Tue, 02 Feb 2016 17:09:03 -0800

Dataset will have access to some of the catalyst/tungsten optimizations
while also giving you scala and types. However that is currently
experimental and not yet as efficient as it could be.
On Feb 2, 2016 7:50 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:


> Sure, having a common distributed query and compute engine for all kind of
> data source is alluring concept to market and advertise and to attract
> potential customers (non engineers, analyst, data scientist). But it's
> nothing new!..but darn old school. it's taking bits and pieces from
> existing sql and no-sql technology. It lacks many panache of robust sql
> engine. I think what put spark aside from everything else on market is RDD!
> and flexibility and scala-like programming style given to developers which
> is simply much more attractive to write then sql syntaxes, schema and
> string constants that falls apart left and right. Writing sql is old
> school. period.  good luck making money though :)
>
> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> To have a product databricks can charge for their sql engine needs to be
>> competitive. That's why they have these optimizations in catalyst. RDD is
>> simply no longer the focus.
>> On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>>
>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>>> physical rdd structure into byte array at some point for optimized GC and
>>> memory. My question is why is it only applicable to SQL/Dataframe and not
>>> RDD? RDD has types too!
>>>
>>>
>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com>
>>> wrote:
>>>
>>>> I haven't gone through much details of spark catalyst optimizer and
>>>> tungston project but we have been advised by databricks support to use
>>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>>> perform external sort and blows with OOM.
>>>>
>>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>>
>>>> Now it's great that it has been addressed in spark 1.5 release but why
>>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>>> jobs or near real-time jobs but not sure if they do when you are developing
>>>> real time analytics where you want to optimize every millisecond that you
>>>> can. Again I am still knowledging myself with DataFrame APIs and
>>>> optimizations and I will benchmark it against RDD for our batch and
>>>> real-time use case as well.
>>>>
>>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> What do you think is preventing you from optimizing your own RDD-level
>>>>> transformations and actions?  AFAIK, nothing that has been added in
>>>>> Catalyst precludes you from doing that.  The fact of the matter is, 
>>>>> though,
>>>>> that there is less type and semantic information available to Spark from
>>>>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>>>>> means that Spark itself can't optimize for raw RDDs the same way that it
>>>>> can for higher-level constructs that can leverage Catalyst; but if you 
>>>>> want
>>>>> to write your own optimizations based on your own knowledge of the data
>>>>> types and semantics that are hiding in your raw RDDs, there's no reason
>>>>> that you can't do that.
>>>>>
>>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>>> optimization details from a seasoned spark users. It's one thing to 
>>>>>> provide
>>>>>> such abstract framework that does optimization for you so you don't have 
>>>>>> to
>>>>>> worry about it as a data scientist or data analyst but what about
>>>>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>>>>> abstractions ! Application designer who knows their data and queries 
>>>>>> should
>>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>>> Catalyst or raw RDD transformation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: What's New with Xactly]
>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>
>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to