First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO.

You can prepare your data like this (say |CustomerCode| is a string field with ordinal 1):

|val  schemaRdd  =  sql(...)
val  schema  =  schemaRdd.schema
val  prepared  =  schemaRdd.keyBy(_.getString(1)).partitionBy(new  
HashPartitioner(n)).values.applySchema(schema)
|

|n| should be equal to |spark.sql.shuffle.partitions|.

Cheng

On 1/19/15 7:44 AM, Mick Davies wrote:

Is it possible to use a HashPartioner or something similar to distribute a
SchemaRDDs data by the hash of a particular column or set of columns.

Having done this I would then hope that GROUP BY could avoid shuffle

E.g. set up a HashPartioner on CustomerCode field so that

SELECT CustomerCode, SUM(Cost)
FROM Orders
GROUP BY CustomerCode

would not need to shuffle.

Cheers
Mick





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Reply via email to