Hi,

If I had a df and I wrote it out via partitionBy("id"), presumably, when I
load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
right? Effectively, we can load in the dataframe with a hash partitioner
already set, since each task can simply read all the folders where
id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an
optimization that is on the radar? This will be a huge boon in terms of
reducing the number of shuffles necessary if we're always joining on the
same columns.

Best,

Justin

Reply via email to