Alright, that doesn't seem to have made it into the Python API yet.
On 2015-05-22 15:12, Silvio Fiorito wrote:
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, "Karlson" <ksonsp...@siberie.de> wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then need
to
make into a Dataframe again? Maybe like this:
df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird
to me, though, and I'm not sure if the DF will be aware of its
partitioning.
On 2015-05-22 12:55, ayan guha wrote:
DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already
optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
On 22 May 2015 19:03, "Karlson" <ksonsp...@siberie.de> wrote:
Hi,
is there any way to control how Dataframes are partitioned? I'm
doing
lots
of joins and am seeing very large shuffle reads and writes in the
Spark UI.
With PairRDDs you can control how the data is partitioned across
nodes
with
partitionBy. There is no such method on Dataframes however. Can I
somehow
partition the underlying the RDD manually? I am currently using the
Python
API.
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org