Hi, If I had a df and I wrote it out via partitionBy("id"), presumably, when I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary right? Effectively, we can load in the dataframe with a hash partitioner already set, since each task can simply read all the folders where id=<value> where hash(<value>) % reducer_count == reducer_id. Is this an optimization that is on the radar? This will be a huge boon in terms of reducing the number of shuffles necessary if we're always joining on the same columns.
Best, Justin