Hi,
is there any way to control how Dataframes are partitioned? I'm doing
lots of joins and am seeing very large shuffle reads and writes in the
Spark UI. With PairRDDs you can control how the data is partitioned
across nodes with partitionBy. There is no such method on Dataframes
however. Can I somehow partition the underlying the RDD manually? I am
currently using the Python API.
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org