Hi,
It seems as if when doing broadcast join, the entire dataframe is resent even
if part of it has already been broadcasted.
Consider the following case:
val df1 = ???
val df2 = ???
val df3 = ???
df3.join(broadcast(df1), on=cond, "left_outer")
followed by
df4.join(broadcast(df1.union(df2), on=cond, "left_outer")
I would expect the second broadcast to only broadcast the difference. However,
if I do explain(true) I see the entire union is broadcast.
My use case is that I have a series of dataframes on which I need to do some
enrichment, joining them with a small dataframe. The small dataframe gets
additional information (as the result of each aggregation).
Is there an efficient way of doing this?
Thanks,
Assaf.