why dont we want these (broadcast join and sort-merge join) in DataFrame but not in RDD?
they dont seem specific to structured data analysis to me. On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <rishi80.mis...@gmail.com> wrote: > Got it..thnx Reynold.. > On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote: > >> The RDDs themselves are not materialized, but the implementations can >> materialize. >> >> E.g. in cogroup (which is used by RDD.join), it materializes all the data >> during grouping. >> >> In SQL/DataFrame join, depending on the join: >> >> 1. For broadcast join, only the smaller side is materialized in memory as >> a hash table. >> >> 2. For sort-merge join, both sides are sorted & streamed through -- >> however, one of the sides need to buffer all the rows having the same join >> key in order to perform the join. >> >> >> >> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra < >> rishi80.mis...@gmail.com> wrote: >> >>> Hi Reynold, >>> Can you please elaborate on this. I thought RDD also opens only an >>> iterator. Does it get materialized for joins? >>> >>> Rishi >>> >>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >>>> streams. >>>> >>>> >>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> in scalding we join with the smaller side on the left, since the >>>>> smaller side will get buffered while the bigger side streams through the >>>>> join. >>>>> >>>>> looking at CoGroupedRDD i do not get the impression such a distiction >>>>> is made. it seems both sided are put into a map that can spill to disk. is >>>>> this correct? >>>>> >>>>> thanks >>>>> >>>> >>>> >>