Re: Partitioning of Dataframes

Karlson Fri, 22 May 2015 05:50:11 -0700

Hi,

wouldn't df.rdd.partitionBy() return a new RDD that I would then need tomake into a Dataframe again? Maybe like this:df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weirdto me, though, and I'm not sure if the DF will be aware of itspartitioning.


On 2015-05-22 12:55, ayan guha wrote:

DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines alreadyoptimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
 On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:
Hi,
is there any way to control how Dataframes are partitioned? I'm doinglotsof joins and am seeing very large shuffle reads and writes in theSpark UI.With PairRDDs you can control how the data is partitioned across nodeswithpartitionBy. There is no such method on Dataframes however. Can Isomehowpartition the underlying the RDD manually? I am currently using thePython
API.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Partitioning of Dataframes

Reply via email to