The closest thing I can think of here is if you have both dataframes written out using buckets. Hive uses this technique for join optimisation such that both datasets of the same bucket are read by the same mapper to achieve map side joins.
On Sat., 29 Jun. 2019, 9:10 pm jelmer, <jkupe...@gmail.com> wrote: > I have 2 dataframes, > > Dataframe A which contains 1 element per partition that is gigabytes big > (an index) > > Dataframe B which is made up out of millions of small rows. > > I want to join B on A but i want all the work to be done on the executors > holding the partitions of dataframe A > > Is there a way to accomplish this without putting dataframe B in a > broadcast variable or doing a broadcast join ? > >