Hi,

wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this: df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird to me, though, and I'm not sure if the DF will be aware of its partitioning.

On 2015-05-22 12:55, ayan guha wrote:
DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
 On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:

Hi,

is there any way to control how Dataframes are partitioned? I'm doing lots of joins and am seeing very large shuffle reads and writes in the Spark UI. With PairRDDs you can control how the data is partitioned across nodes with partitionBy. There is no such method on Dataframes however. Can I somehow partition the underlying the RDD manually? I am currently using the Python
API.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to