Spark SQL: Avoid shuffles when data is already partitioned on disk

Justin Uang Thu, 21 Jan 2016 03:23:00 -0800

Hi,

If I had a df and I wrote it out via partitionBy("id"), presumably, when I
load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
right? Effectively, we can load in the dataframe with a hash partitioner
already set, since each task can simply read all the folders where
id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
optimization that is on the radar? This will be a huge boon in terms of
reducing the number of shuffles necessary if we're always joining on the
same columns.


Best,

Justin

Spark SQL: Avoid shuffles when data is already partitioned on disk

Reply via email to