Re: Partitioning of Dataframes

Karlson Fri, 22 May 2015 06:59:38 -0700

Alright, that doesn't seem to have made it into the Python API yet.


On 2015-05-22 15:12, Silvio Fiorito wrote:

This is added to 1.4.0

https://github.com/apache/spark/pull/5762







On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote:

Hi,

wouldn't df.rdd.partitionBy() return a new RDD that I would then needto

make into a Dataframe again? Maybe like this:
df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird
to me, though, and I'm not sure if the DF will be aware of its
partitioning.

On 2015-05-22 12:55, ayan guha wrote:

DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already
optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
 On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:

Hi,

is there any way to control how Dataframes are partitioned? I'mdoing

lots
of joins and am seeing very large shuffle reads and writes in the
Spark UI.

With PairRDDs you can control how the data is partitioned acrossnodes

with
partitionBy. There is no such method on Dataframes however. Can I
somehow
partition the underlying the RDD manually? I am currently using the
Python
API.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Partitioning of Dataframes

Reply via email to