Re: CombinePerKey and GroupByKey

Etienne Chauchot Fri, 01 Mar 2019 01:01:04 -0800
That's good to know
Thanks
Etienne
Le jeudi 28 février 2019 à 10:05 -0800, Reynold Xin a écrit :
> This should be fine. Dataset.groupByKey is a logical operation, not a 
> physical one (as in Spark wouldn’t always
> materialize all the groups in memory). 
> On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot <echauc...@apache.org> wrote:
> > Hi all,
> > 
> > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no 
> > more there in the Dataset API.  So, I
> > translated it to:
> > 
> > 
> > KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
> >     keyedDataset.groupByKey(KVHelpers.extractKey(), 
> > EncoderHelpers.genericEncoder());
> > 
> > Dataset<Tuple2<K, OutputT>> combinedDataset =
> >     groupedDataset.agg(
> >         new Aggregator<KV<K, InputT>, AccumT, 
> > OutputT>(combineFn).toColumn());
> > 
> > I have an interrogation regarding performance : as GroupByKey is generally 
> > less performant (entails shuffle and
> > possible OOM if a given key has a lot of data associated to it), I was 
> > wondering if the new spark optimizer
> > translates such a DAG into a combinePerKey behind the scene. In other 
> > words, is such a DAG going to be translated to
> > a local (or partial I don't know what terminology you use) combine and then 
> > a global combine to avoid shuffle?
> > 
> > Thanks
> > Etienne
Re: CombinePerKey and GroupByKey

Reply via email to