First of all, even if the underlying dataset is partitioned as expected,
a shuffle can’t be avoided. Because Spark SQL knows nothing about the
underlying data distribution. However, this does reduce network IO.
You can prepare your data like this (say |CustomerCode| is a string
field with ordinal 1):
|val schemaRdd = sql(...)
val schema = schemaRdd.schema
val prepared = schemaRdd.keyBy(_.getString(1)).partitionBy(new
HashPartitioner(n)).values.applySchema(schema)
|
|n| should be equal to |spark.sql.shuffle.partitions|.
Cheng
On 1/19/15 7:44 AM, Mick Davies wrote:
Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.
Having done this I would then hope that GROUP BY could avoid shuffle
E.g. set up a HashPartioner on CustomerCode field so that
SELECT CustomerCode, SUM(Cost)
FROM Orders
GROUP BY CustomerCode
would not need to shuffle.
Cheers
Mick
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org