Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Jerry Lam Wed, 03 Feb 2016 05:52:08 -0800

Hi Nirav,

I don't know why those optimizations are not implemented in RDD. It is either a 
political choice or a practical one (backward compatibility might be difficult 
if they need to introduce these types of optimization into RDD). I think this 
is the reasons spark now has Dataset. My understanding is that Dataset is the 
new RDD.



Best Regards,

Jerry

Sent from my iPhone

> On 3 Feb, 2016, at 12:26 am, Koert Kuipers <ko...@tresata.com> wrote:
> 
> with respect to joins, unfortunately not all implementations are available. 
> for example i would like to use joins where one side is streaming (and the 
> other cached). this seems to be available for DataFrame but not for RDD.
> 
>> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel <npa...@xactlycorp.com> wrote:
>> Hi Jerry,
>> 
>> Yes I read that benchmark. And doesn't help in most cases. I'll give you 
>> example of one of our application. It's a memory hogger by nature since it 
>> works on groupByKey and performs combinatorics on Iterator. So it maintain 
>> few structures inside task. It works on mapreduce with half the resources I 
>> am giving it for spark and Spark keeps throwing OOM on a pre-step which is a 
>> simple join! I saw every task was done at process_local locality still join 
>> keeps failing due to container being killed. and container gets killed due 
>> to oom.  We have a case with Databricks/Mapr on that for more then a month. 
>> anyway don't wanna distract there. I can believe that changing to DataFrame 
>> and it's computing model can bring performance but I was hoping that 
>> wouldn't be your answer to every performance problem.  
>> 
>> Let me ask this - If I decide to stick with RDD do I still have flexibility 
>> to choose what Join implementation I can use? And similar underlaying 
>> construct to best execute my jobs. 
>> 
>> I said error prone because you need to write column qualifiers instead of 
>> referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1'; 
>> more over multiple tables having similar column names causing parsing 
>> issues; and when you start writing constants for your columns it just become 
>> another schema maintenance inside your app. It feels like thing of past. 
>> Query engine(distributed or not) is old school as I 'see' it :)
>> 
>> Thanks for being patient.
>> Nirav
>> 
>> 
>> 
>> 
>> 
>>> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>> Hi Nirav,
>>> I'm sure you read this? 
>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>> 
>>> There is a benchmark in the article to show that dataframe "can" outperform 
>>> RDD implementation by 2 times. Of course, benchmarks can be "made". But 
>>> from the code snippet you wrote, I "think" dataframe will choose between 
>>> different join implementation based on the data statistics. 
>>> 
>>> I cannot comment on the beauty of it because "beauty is in the eye of the 
>>> beholder" LOL
>>> Regarding the comment on error prone, can you say why you think it is the 
>>> case? Relative to what other ways?
>>> 
>>> Best Regards,
>>> 
>>> Jerry
>>> 
>>> 
>>>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel <npa...@xactlycorp.com> wrote:
>>>> I dont understand why one thinks RDD of case object doesn't have 
>>>> types(schema) ? If spark can convert RDD to DataFrame which means it 
>>>> understood the schema. SO then from that point why one has to use SQL 
>>>> features to do further processing? If all spark need for optimizations is 
>>>> schema then what this additional SQL features buys ? If there is a way to 
>>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have 
>>>> to convert all my existing transformation to things like 
>>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly 
>>>> and error prone in my opinion. 
>>>> 
>>>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>>> Hi Michael,
>>>>> 
>>>>> Is there a section in the spark documentation demonstrate how to 
>>>>> serialize arbitrary objects in Dataframe? The last time I did was using 
>>>>> some User Defined Type (copy from VectorUDT). 
>>>>> 
>>>>> Best Regards,
>>>>> 
>>>>> Jerry
>>>>> 
>>>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <mich...@databricks.com> 
>>>>> wrote:
>>>>>>> A principal difference between RDDs and DataFrames/Datasets is that the 
>>>>>>> latter have a schema associated to them. This means that they support 
>>>>>>> only certain types (primitives, case classes and more) and that they 
>>>>>>> are uniform, whereas RDDs can contain any serializable object and must 
>>>>>>> not necessarily be uniform. These properties make it possible to 
>>>>>>> generate very efficient serialization and other optimizations that 
>>>>>>> cannot be achieved with plain RDDs.
>>>>>> 
>>>>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just 
>>>>>> like with RDDs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>         
>> 
>> 
>> 
>> 
>> 
>> 
>>         
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to