DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
 On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:

> Hi,
>
> is there any way to control how Dataframes are partitioned? I'm doing lots
> of joins and am seeing very large shuffle reads and writes in the Spark UI.
> With PairRDDs you can control how the data is partitioned across nodes with
> partitionBy. There is no such method on Dataframes however. Can I somehow
> partition the underlying the RDD manually? I am currently using the Python
> API.
>
> Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to