[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841516#comment-13841516
 ] 

Xuefu Zhang commented on HIVE-5878:
-----------------------------------

[~ehans] Thank you for your concern. However, I respectfully disagree that the 
behavior WAS and IS reasonable for several reasons. First, AVG was probably 
introduced before decimal, so there was no better choice than double. Hive has 
the concept of exact types (int, long, decimal, etc.) vs approximate types 
(double, float, etc), and Arithmetic operations (plus, divide, etc) on exact 
types generates exact type for accuracy. If average is defined mathematically 
as sum/count, then sum(int)/count should result an exact type. Otherwise, avg() 
and sum()/count give different result. Another inconsistency exists when 
avg(decimal) results a decimal. All those cause inconsistency in Hive's 
mathematical concept and function behavior, and can create confusions among 
users as well.

I understand vectorized current implementation chooses double for sum and uses 
sum/count to get another double for average. While this extends the scope of 
the changes, to me, however, vectorization is just implementation, which should 
not dictate high-level concept and consistency.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-5878
>                 URL: https://issues.apache.org/jira/browse/HIVE-5878
>             Project: Hive
>          Issue Type: Bug
>          Components: Types, UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d                     int                     None                
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: avg(VALUE._col0)
>           bucketGroup: false
>           mode: mergepartial
>           outputColumnNames: _col0
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +-------+--------------+------+-----+---------+-------+
> | Field | Type         | Null | Key | Default | Extra |
> +-------+--------------+------+-----+---------+-------+
> | i     | int(11)      | YES  |     | NULL    |       |
> | b     | tinyint(1)   | YES  |     | NULL    |       |
> | d     | double       | YES  |     | NULL    |       |
> | s     | varchar(5)   | YES  |     | NULL    |       |
> | dd    | decimal(5,2) | YES  |     | NULL    |       |
> +-------+--------------+------+-----+---------+-------+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +-------+---------------+------+-----+---------+-------+
> | Field | Type          | Null | Key | Default | Extra |
> +-------+---------------+------+-----+---------+-------+
> | avg(i) | decimal(14,4) | YES  |     | NULL    |       |
> +-------+---------------+------+-----+---------+-------+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to