Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Nirav Patel Wed, 03 Feb 2016 13:43:04 -0800

Awesome! I just read design docs. That is EXACTLY what I was talking about!
Looking forward to it!


Thanks

On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers <ko...@tresata.com> wrote:

> yeah there was some discussion about adding them to RDD, but it would
> break a lot. so Dataset was born.
>
> yes it seems Dataset will be the new RDD for most use cases. but i dont
> think its there yet. just keep an eye out on SPARK-9999 for updates...
>
> On Wed, Feb 3, 2016 at 8:51 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Nirav,
>>
>> I don't know why those optimizations are not implemented in RDD. It is
>> either a political choice or a practical one (backward compatibility might
>> be difficult if they need to introduce these types of optimization into
>> RDD). I think this is the reasons spark now has Dataset. My understanding
>> is that Dataset is the new RDD.
>>
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> with respect to joins, unfortunately not all implementations are
>> available. for example i would like to use joins where one side is
>> streaming (and the other cached). this seems to be available for DataFrame
>> but not for RDD.
>>
>> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <npa...@xactlycorp.com>
>> wrote:
>>
>>> Hi Jerry,
>>>
>>> Yes I read that benchmark. And doesn't help in most cases. I'll give you
>>> example of one of our application. It's a memory hogger by nature since it
>>> works on groupByKey and performs combinatorics on Iterator. So it maintain
>>> few structures inside task. It works on mapreduce with half the resources I
>>> am giving it for spark and Spark keeps throwing OOM on a pre-step which is
>>> a simple join! I saw every task was done at process_local locality still
>>> join keeps failing due to container being killed. and container gets killed
>>> due to oom.  We have a case with Databricks/Mapr on that for more then a
>>> month. anyway don't wanna distract there. I can believe that changing to
>>> DataFrame and it's computing model can bring performance but I was hoping
>>> that wouldn't be your answer to every performance problem.
>>>
>>> Let me ask this - If I decide to stick with RDD do I still have
>>> flexibility to choose what Join implementation I can use? And similar
>>> underlaying construct to best execute my jobs.
>>>
>>> I said error prone because you need to write column qualifiers instead
>>> of referencing fields. i.e. 'caseObj("field1")' instead of
>>> 'caseObj.field1'; more over multiple tables having similar column names
>>> causing parsing issues; and when you start writing constants for your
>>> columns it just become another schema maintenance inside your app. It feels
>>> like thing of past. Query engine(distributed or not) is old school as I
>>> 'see' it :)
>>>
>>> Thanks for being patient.
>>> Nirav
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>
>>>> Hi Nirav,
>>>> I'm sure you read this?
>>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>>
>>>> There is a benchmark in the article to show that dataframe "can"
>>>> outperform RDD implementation by 2 times. Of course, benchmarks can be
>>>> "made". But from the code snippet you wrote, I "think" dataframe will
>>>> choose between different join implementation based on the data statistics.
>>>>
>>>> I cannot comment on the beauty of it because "beauty is in the eye of
>>>> the beholder" LOL
>>>> Regarding the comment on error prone, can you say why you think it is
>>>> the case? Relative to what other ways?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>>
>>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I dont understand why one thinks RDD of case object doesn't have
>>>>> types(schema) ? If spark can convert RDD to DataFrame which means it
>>>>> understood the schema. SO then from that point why one has to use SQL
>>>>> features to do further processing? If all spark need for optimizations is
>>>>> schema then what this additional SQL features buys ? If there is a way to
>>>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have 
>>>>> to
>>>>> convert all my existing transformation to things like
>>>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>>>>> and error prone in my opinion.
>>>>>
>>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Is there a section in the spark documentation demonstrate how to
>>>>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>>>>> some User Defined Type (copy from VectorUDT).
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> A principal difference between RDDs and DataFrames/Datasets is that
>>>>>>>> the latter have a schema associated to them. This means that they 
>>>>>>>> support
>>>>>>>> only certain types (primitives, case classes and more) and that they 
>>>>>>>> are
>>>>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>>>>> necessarily be uniform. These properties make it possible to generate 
>>>>>>>> very
>>>>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>>>>> with plain RDDs.
>>>>>>>>
>>>>>>>
>>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects,
>>>>>>> just like with RDDs.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>
>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>> <https://twitter.com/Xactly>  [image: Facebook]
>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>> <http://www.youtube.com/xactlycorporation>
>>>
>>
>>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to