Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reynold Xin Thu, 21 Jan 2016 19:15:07 -0800

It is not necessary if you are using bucketing available in Spark 2.0. For
partitioning, it is still necessary because we do not assume each partition
is small, and as a result there is no guarantee all the records for a
partition end up in a single Spark task partition.



On Thu, Jan 21, 2016 at 3:22 AM, Justin Uang <justin.u...@gmail.com> wrote:

> Hi,
>
> If I had a df and I wrote it out via partitionBy("id"), presumably, when I
> load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
> right? Effectively, we can load in the dataframe with a hash partitioner
> already set, since each task can simply read all the folders where
> id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
> optimization that is on the radar? This will be a huge boon in terms of
> reducing the number of shuffles necessary if we're always joining on the
> same columns.
>
> Best,
>
> Justin
>

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

Reply via email to