Just trying to make sure that something I know in advance (the joins will 
always have an equality test on one specific field) is used to optimize the 
partitioning so the joins only use local data.

Thanks for the info.

Ron


From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, May 08, 2015 3:15 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Hash Partitioning and Dataframes

What are you trying to accomplish?  Internally Spark SQL will add Exchange 
operators to make sure that data is partitioned correctly for joins and 
aggregations.  If you are going to do other RDD operations on the result of 
dataframe operations and you need to manually control the partitioning, call 
df.rdd and partition as you normally would.

On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) 
<r.dan...@elsevier.com<mailto:r.dan...@elsevier.com>> wrote:
Hi,

How can I ensure that a batch of DataFrames I make are all partitioned based on 
the value of one column common to them all?
For RDDs I would partitionBy a HashPartitioner, but I don't see that in the 
DataFrame API.
If I partition the RDDs that way, then do a toDF(), will the partitioning be 
preserved?

Thanks,
Ron

Reply via email to