Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Jerry Lam Tue, 02 Feb 2016 18:26:51 -0800

Hi Nirav,
I'm sure you read this?
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html


There is a benchmark in the article to show that dataframe "can" outperform
RDD implementation by 2 times. Of course, benchmarks can be "made". But
from the code snippet you wrote, I "think" dataframe will choose between
different join implementation based on the data statistics.

I cannot comment on the beauty of it because "beauty is in the eye of the
beholder" LOL
Regarding the comment on error prone, can you say why you think it is the
case? Relative to what other ways?

Best Regards,

Jerry


On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote:

> I dont understand why one thinks RDD of case object doesn't have
> types(schema) ? If spark can convert RDD to DataFrame which means it
> understood the schema. SO then from that point why one has to use SQL
> features to do further processing? If all spark need for optimizations is
> schema then what this additional SQL features buys ? If there is a way to
> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
> convert all my existing transformation to things like
> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
> and error prone in my opinion.
>
> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Michael,
>>
>> Is there a section in the spark documentation demonstrate how to
>> serialize arbitrary objects in Dataframe? The last time I did was using
>> some User Defined Type (copy from VectorUDT).
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> A principal difference between RDDs and DataFrames/Datasets is that the
>>>> latter have a schema associated to them. This means that they support only
>>>> certain types (primitives, case classes and more) and that they are
>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>> necessarily be uniform. These properties make it possible to generate very
>>>> efficient serialization and other optimizations that cannot be achieved
>>>> with plain RDDs.
>>>>
>>>
>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>>> like with RDDs.
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to