subject:"How to hint Spark to use HashAggregate\(\) for UDAF"

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang

Hi Takeshi, Thanks for the answer. My UDAF aggregates data into an array of rows. Apparently this makes it ineligible to using Hash-based aggregate based on the logic at: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationM

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro

Hi, Spark always uses hash-based aggregates if the types of aggregated data are supported there; otherwise, spark fails to use hash-based ones, then it uses sort-based ones. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.s

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang

Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[], functions