why dont we want these (broadcast join and sort-merge join) in DataFrame
but not in RDD?

they dont seem specific to structured data analysis to me.

On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <rishi80.mis...@gmail.com>
wrote:

> Got it..thnx Reynold..
> On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote:
>
>> The RDDs themselves are not materialized, but the implementations can
>> materialize.
>>
>> E.g. in cogroup (which is used by RDD.join), it materializes all the data
>> during grouping.
>>
>> In SQL/DataFrame join, depending on the join:
>>
>> 1. For broadcast join, only the smaller side is materialized in memory as
>> a hash table.
>>
>> 2. For sort-merge join, both sides are sorted & streamed through --
>> however, one of the sides need to buffer all the rows having the same join
>> key in order to perform the join.
>>
>>
>>
>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
>> rishi80.mis...@gmail.com> wrote:
>>
>>> Hi Reynold,
>>> Can you please elaborate on this. I thought RDD also opens only an
>>> iterator. Does it get materialized for joins?
>>>
>>> Rishi
>>>
>>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>>>> streams.
>>>>
>>>>
>>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> in scalding we join with the smaller side on the left, since the
>>>>> smaller side will get buffered while the bigger side streams through the
>>>>> join.
>>>>>
>>>>> looking at CoGroupedRDD i do not get the impression such a distiction
>>>>> is made. it seems both sided are put into a map that can spill to disk. is
>>>>> this correct?
>>>>>
>>>>> thanks
>>>>>
>>>>
>>>>
>>

Reply via email to