Re: Dataset and Aggregator API pain points

Reynold Xin Tue, 05 Jul 2016 23:23:57 -0700

See https://issues.apache.org/jira/browse/SPARK-16390


On Sat, Jul 2, 2016 at 6:35 PM, Reynold Xin <r...@databricks.com> wrote:

> Thanks, Koert, for the great email. They are all great points.
>
> We should probably create an umbrella JIRA for easier tracking.
>
>
> On Saturday, July 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>
>> after working with the Dataset and Aggregator apis for a few weeks
>> porting some fairly complex RDD algos (an overall pleasant experience) i
>> wanted to summarize the pain points and some suggestions for improvement
>> given my experience. all of these are already mentioned on mailing list or
>> jira, but i figured its good to put them in one place.
>> see below.
>> best,
>> koert
>>
>> *) a lot of practical aggregation functions do not have a zero. this can
>> be dealt with correctly using null or None as the zero for Aggregator. in
>> algebird for example this is expressed as converting an algebird.Aggregator
>> (which does not have a zero) into a algebird.MonoidAggregator (which does
>> have a zero, so similar to spark Aggregator) by lifting it. see:
>>
>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
>> something similar should be possible in spark. however currently
>> Aggregator does not like its zero to be null or an Option, making this
>> approach difficult. see:
>> https://www.mail-archive.com/user@spark.apache.org/msg53106.html
>> https://issues.apache.org/jira/browse/SPARK-15810
>>
>> *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
>> using an Aggregator (with null or None as the zero) under the hood. the
>> current implementation does a flatMapGroups which is suboptimal.
>>
>> *) KeyValueGroupedDataset needs mapValues. without this porting many
>> algos from RDD to Dataset is difficult and clumsy. see:
>> https://issues.apache.org/jira/browse/SPARK-15780
>>
>> *) Aggregators need to also work within DataFrames (so
>> RelationalGroupedDataset) without having to fall back on using Row objects
>> as input. otherwise all code ends up being written twice, once for
>> Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
>> make sense to me. my attempt at addressing this:
>> https://issues.apache.org/jira/browse/SPARK-15769
>> https://github.com/apache/spark/pull/13512
>>
>> best,
>> koert
>>
>>

Re: Dataset and Aggregator API pain points

Reply via email to