[
https://issues.apache.org/jira/browse/SPARK-15769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-15769:
---------------------------------
Labels: bulk-closed (was: )
> Add Encoder for input type to Aggregator
> ----------------------------------------
>
> Key: SPARK-15769
> URL: https://issues.apache.org/jira/browse/SPARK-15769
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: koert kuipers
> Priority: Minor
> Labels: bulk-closed
>
> Currently org.apache.spark.sql.expressions.Aggregator has Encoders for its
> buffer and output type, but not for its input type. The thought is that the
> input type is known from the Dataset it operates on and hence can be inserted
> later.
> However i think there are compelling reasons to have Aggregator carry an
> Encoder for its input type:
> * Generally transformations on Dataset only require the Encoder for the
> result type since the input type is exactly known and it's Encoder is already
> available within the Dataset. However this is not the case for an Aggregator:
> an Aggregator is defined independently of a Dataset, and i think it should be
> generally desirable that an Aggregator work on any type that can safely be
> cast to the Aggregator's input type (for example an Aggregator that has Long
> as input should work on a Dataset of Ints).
> * Aggregators should also work on DataFrames, because its a much nicer API to
> use than UserDefinedAggregateFunction. And when operating on DataFrames you
> should not have to use Row objects, which means your input type is not equal
> to the type of the Dataset you operate on (so the Encoder of the Dataset that
> is operated on should not be used as input Encoder for the Aggregator).
> * Adding an input Encoder is not a big burden, since it can typically be
> created implicitly
> * It removes TypedColumn.withInputType and its usage in Dataset,
> KeyValueGroupedDataset and RelationalGroupedDataset, which always felt
> somewhat ad-hoc to me
> * Once an Aggregator has an Encoder for it's input type it is a small change
> to make the Aggregator also work on a subset of columns in a DataFrame, which
> facilitates Aggregator re-use since you don't have to write a custom
> Aggregator to extract the columns from a specific DataFrame. This also
> enables a usage that is more typical within a DataFrame context, very similar
> to how a UserDefinedAggregateFunction is used.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]