Re: Review Request 42508: HIVE-12889: Support COUNT(DISTINCT) for partitioning query.

Aihua Xu Fri, 22 Jan 2016 07:00:43 -0800


> On Jan. 22, 2016, 8:04 a.m., Szehon Ho wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java, 
> > line 161
> > <https://reviews.apache.org/r/42508/diff/2/?file=1203554#file1203554line161>
> >
> >     Why do we need to do ArrayUtils.isEquals as well as hash comparison?
> >     
> >     And another suggestion, can we do something like 
> > ObjectInspectorUtils.compare like other UDAF's?

Actually I have difficulties to make a copy of the previous rows and do a 
comparison. Also for your second question, I tried all the data types and 
noticed that I need to make a copy for Text/LazyString, but really not 100% 
sure if anything else needs a copy. Because of that, I added a hash comparison 
in addition to ArrayUtils.isEquals() for the comparison. In most of the cases 
(for all the hive data types), the hash should be able to represent the data, 
but as the standard, different data could have the same hash. That's why when 
the hash is the same, I need to compare the data itself to tell if they are the 
same or not. 

With the current hash code implementation for the hive data types, I think hash 
comparison probably is enough since the hash is generated from characters of 
Text/String/map/..., or numbers of int, double,etc. Of course, there is no such 
guarantee for that. How do you think? 

I checked ObjectInspectorUtils.compare(). That is much better for comparison. 

Do you think just hash comparison is enough? Any better idea to make a copy?

- Aihua

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42508/#review115804
-----------------------------------------------------------

On Jan. 20, 2016, 5:05 p.m., Aihua Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42508/
> -----------------------------------------------------------
> 
> (Updated Jan. 20, 2016, 5:05 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Szehon Ho, and Xuefu Zhang.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-12889: Support COUNT(DISTINCT) for partitioning query.
> 
> 
> Diffs
> -----
> 
>   data/files/windowing_distinct.txt PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/functions/HiveSqlCountAggFunction.java
>  7937040 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/functions/HiveSqlSumAggFunction.java
>  8f62970 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/PlanModifierForASTConv.java
>  e2fbb4f 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/SqlFunctionConverter.java
>  37249f9 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 3fefbd7 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g 15ca754 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/PTFInvocationSpec.java 29b8510 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 15773e5 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/WindowingSpec.java a181f7c 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java 
> eaf112e 
>   ql/src/test/queries/clientpositive/windowing_distinct.q PRE-CREATION 
>   ql/src/test/results/clientpositive/windowing_distinct.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/42508/diff/
> 
> 
> Testing
> -------
> 
> Support count(distinct) over partitioning window. 
> 
> 1. Enabling the parser to properly parse such query "count(distinct) over 
> (partition by c1)";
> 2. ORDER BY and windowing frame won't work with the functions of distinct due 
> to performance concern and implementation requirement.
> 3. We insert the distinct fields into the order by list, so during counting, 
> we only need to compare the current row against the previous remembered row.
> 
> 
> Thanks,
> 
> Aihua Xu
> 
>

Re: Review Request 42508: HIVE-12889: Support COUNT(DISTINCT) for partitioning query.

Reply via email to