This should be fine. Dataset.groupByKey is a logical operation, not a
physical one (as in Spark wouldn’t always materialize all the groups in
memory).

On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi all,
>
> I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no
> more there in the Dataset API. So, I translated it to:
>
>
> KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
>     keyedDataset.groupByKey(KVHelpers.extractKey(), 
> EncoderHelpers.genericEncoder());
>
> Dataset<Tuple2<K, OutputT>> combinedDataset =
>     groupedDataset.agg(
>         new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn());
>
>
> I have an interrogation regarding performance : as GroupByKey is generally
> less performant (entails shuffle and possible OOM if a given key has a lot
> of data associated to it), I was wondering if the new spark optimizer
> translates such a DAG into a combinePerKey behind the scene. In other
> words, is such a DAG going to be translated to a local (or partial I don't
> know what terminology you use) combine and then a global combine to avoid
> shuffle?
>
> Thanks
>
> Etienne
>

Reply via email to