Hi Jestin,
Spark is smart about how it does joins. In this case, if df2 is sufficiently
small it will do a broadcast join. Basically, rather than shuffle df1/df2 for a
join, it broadcasts df2 to all workers and joins locally. Looks like you may
already have known that though based on using the
Hello,
Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one
DataFrame and join with another, df2.
The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).
Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB
RAM each.
It is taking me abou