You can use coalesce(1) or repartition on B but it would be better to put A in cache so that it becomes available on all executors and as well as in memory because it contians on one row.
On Sat, Jun 29, 2019 at 4:10 PM jelmer <jkupe...@gmail.com> wrote: > I have 2 dataframes, > > Dataframe A which contains 1 element per partition that is gigabytes big > (an index) > > Dataframe B which is made up out of millions of small rows. > > I want to join B on A but i want all the work to be done on the executors > holding the partitions of dataframe A > > Is there a way to accomplish this without putting dataframe B in a > broadcast variable or doing a broadcast join ? > > -- Regards, Arbab Khalil Software Design Engineer