That's good to know
Thanks
Etienne
Le jeudi 28 février 2019 à 10:05 -0800, Reynold Xin a écrit :
> This should be fine. Dataset.groupByKey is a logical operation, not a
> physical one (as in Spark wouldn’t always
> materialize all the groups in memory).
> On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot <echauc...@apache.org> wrote:
> > Hi all,
> >
> > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no
> > more there in the Dataset API. So, I
> > translated it to:
> >
> >
> > KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
> > keyedDataset.groupByKey(KVHelpers.extractKey(),
> > EncoderHelpers.genericEncoder());
> >
> > Dataset<Tuple2<K, OutputT>> combinedDataset =
> > groupedDataset.agg(
> > new Aggregator<KV<K, InputT>, AccumT,
> > OutputT>(combineFn).toColumn());
> >
> > I have an interrogation regarding performance : as GroupByKey is generally
> > less performant (entails shuffle and
> > possible OOM if a given key has a lot of data associated to it), I was
> > wondering if the new spark optimizer
> > translates such a DAG into a combinePerKey behind the scene. In other
> > words, is such a DAG going to be translated to
> > a local (or partial I don't know what terminology you use) combine and then
> > a global combine to avoid shuffle?
> >
> > Thanks
> > Etienne