It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition.
On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com> wrote: > Hi, > > If I had a df and I wrote it out via partitionBy("id"), presumably, when I > load in the df and do a groupBy("id"), a shuffle shouldn't be necessary > right? Effectively, we can load in the dataframe with a hash partitioner > already set, since each task can simply read all the folders where > id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an > optimization that is on the radar? This will be a huge boon in terms of > reducing the number of shuffles necessary if we're always joining on the > same columns. > > Best, > > Justin >