Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Nirav Patel Tue, 02 Feb 2016 21:20:07 -0800

Hi Jerry,

Yes I read that benchmark. And doesn't help in most cases. I'll give you
example of one of our application. It's a memory hogger by nature since it
works on groupByKey and performs combinatorics on Iterator. So it maintain
few structures inside task. It works on mapreduce with half the resources I
am giving it for spark and Spark keeps throwing OOM on a pre-step which is
a simple join! I saw every task was done at process_local locality still
join keeps failing due to container being killed. and container gets killed
due to oom.  We have a case with Databricks/Mapr on that for more then a
month. anyway don't wanna distract there. I can believe that changing to
DataFrame and it's computing model can bring performance but I was hoping
that wouldn't be your answer to every performance problem.


Let me ask this - If I decide to stick with RDD do I still have flexibility
to choose what Join implementation I can use? And similar underlaying
construct to best execute my jobs.

I said error prone because you need to write column qualifiers instead of
referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
more over multiple tables having similar column names causing parsing
issues; and when you start writing constants for your columns it just
become another schema maintenance inside your app. It feels like thing of
past. Query engine(distributed or not) is old school as I 'see' it :)

Thanks for being patient.
Nirav





On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Nirav,
> I'm sure you read this?
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> There is a benchmark in the article to show that dataframe "can"
> outperform RDD implementation by 2 times. Of course, benchmarks can be
> "made". But from the code snippet you wrote, I "think" dataframe will
> choose between different join implementation based on the data statistics.
>
> I cannot comment on the beauty of it because "beauty is in the eye of the
> beholder" LOL
> Regarding the comment on error prone, can you say why you think it is the
> case? Relative to what other ways?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
>
>> I dont understand why one thinks RDD of case object doesn't have
>> types(schema) ? If spark can convert RDD to DataFrame which means it
>> understood the schema. SO then from that point why one has to use SQL
>> features to do further processing? If all spark need for optimizations is
>> schema then what this additional SQL features buys ? If there is a way to
>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>> convert all my existing transformation to things like
>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>> and error prone in my opinion.
>>
>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Is there a section in the spark documentation demonstrate how to
>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>> some User Defined Type (copy from VectorUDT).
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mich...@databricks.com
>>> > wrote:
>>>
>>>> A principal difference between RDDs and DataFrames/Datasets is that the
>>>>> latter have a schema associated to them. This means that they support only
>>>>> certain types (primitives, case classes and more) and that they are
>>>>> uniform, whereas RDDs can contain any serializable object and must not
>>>>> necessarily be uniform. These properties make it possible to generate very
>>>>> efficient serialization and other optimizations that cannot be achieved
>>>>> with plain RDDs.
>>>>>
>>>>
>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>>>> like with RDDs.
>>>>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to