DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:
> Hi, > > is there any way to control how Dataframes are partitioned? I'm doing lots > of joins and am seeing very large shuffle reads and writes in the Spark UI. > With PairRDDs you can control how the data is partitioned across nodes with > partitionBy. There is no such method on Dataframes however. Can I somehow > partition the underlying the RDD manually? I am currently using the Python > API. > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >