The closest thing I can think of here is if you have both dataframes
written out using buckets. Hive uses this technique for join optimisation
such that both datasets of the same bucket are read by the same mapper to
achieve map side joins.

On Sat., 29 Jun. 2019, 9:10 pm jelmer, <jkupe...@gmail.com> wrote:

> I have 2 dataframes,
>
> Dataframe A which contains 1 element per partition that is gigabytes big
> (an index)
>
> Dataframe B which is made up out of millions of small rows.
>
> I want to join B on A but i want all the work to be done on the executors
> holding the partitions of dataframe A
>
> Is there a way to accomplish this without putting dataframe B in a
> broadcast variable or doing a broadcast join ?
>
>

Reply via email to