See https://issues.apache.org/jira/browse/SPARK-16390
On Sat, Jul 2, 2016 at 6:35 PM, Reynold Xin <r...@databricks.com> wrote: > Thanks, Koert, for the great email. They are all great points. > > We should probably create an umbrella JIRA for easier tracking. > > > On Saturday, July 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: > >> after working with the Dataset and Aggregator apis for a few weeks >> porting some fairly complex RDD algos (an overall pleasant experience) i >> wanted to summarize the pain points and some suggestions for improvement >> given my experience. all of these are already mentioned on mailing list or >> jira, but i figured its good to put them in one place. >> see below. >> best, >> koert >> >> *) a lot of practical aggregation functions do not have a zero. this can >> be dealt with correctly using null or None as the zero for Aggregator. in >> algebird for example this is expressed as converting an algebird.Aggregator >> (which does not have a zero) into a algebird.MonoidAggregator (which does >> have a zero, so similar to spark Aggregator) by lifting it. see: >> >> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420 >> something similar should be possible in spark. however currently >> Aggregator does not like its zero to be null or an Option, making this >> approach difficult. see: >> https://www.mail-archive.com/user@spark.apache.org/msg53106.html >> https://issues.apache.org/jira/browse/SPARK-15810 >> >> *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably >> using an Aggregator (with null or None as the zero) under the hood. the >> current implementation does a flatMapGroups which is suboptimal. >> >> *) KeyValueGroupedDataset needs mapValues. without this porting many >> algos from RDD to Dataset is difficult and clumsy. see: >> https://issues.apache.org/jira/browse/SPARK-15780 >> >> *) Aggregators need to also work within DataFrames (so >> RelationalGroupedDataset) without having to fall back on using Row objects >> as input. otherwise all code ends up being written twice, once for >> Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't >> make sense to me. my attempt at addressing this: >> https://issues.apache.org/jira/browse/SPARK-15769 >> https://github.com/apache/spark/pull/13512 >> >> best, >> koert >> >>