Since the data is in multiple JVMs, only 1 of them can be the driver. So
I can parallelize the data from 1 of the VMs but don't have a way to do the
same for the others. Or am I missing something?
On Tue, Feb 28, 2017 at 3:53 PM, ayan guha wrote:
> How about parallelize and then union all of
How about parallelize and then union all of them to one data frame?
On Wed, 1 Mar 2017 at 3:07 am, Sean Owen wrote:
> Broadcasts let you send one copy of read only data to each executor.
> That's not the same as a DataFrame and itseems nature means it doesnt make
> sense to think of them as not
Broadcasts let you send one copy of read only data to each executor. That's
not the same as a DataFrame and itseems nature means it doesnt make sense
to think of them as not distributed. But consider things like broadcast
hash joins which may be what you are looking for if you really mean to join
o