Aditya Allamraju created HIVE-17355:
---------------------------------------

             Summary: Casting to Decimal along with UNION ALL gives incosistent 
results
                 Key: HIVE-17355
                 URL: https://issues.apache.org/jira/browse/HIVE-17355
             Project: Hive
          Issue Type: Bug
          Components: Parser, UDF
    Affects Versions: 2.1.1, 2.1.0
         Environment: CentOS 7.2
            Reporter: Aditya Allamraju


Extra trailing zeros are added when running "union all" on the tables 
containing decimal data types.

Env:-
=====
mapr-hivemetastore-2.1.201703241741-1.noarch
mapr-hive-2.1.201703241741-1.noarch
mapr-hiveserver2-2.1.201703241741-1.noarch

Steps to repro:-
================

1) CREATE TABLE `decisample`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample'

2) CREATE TABLE `decisample3`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample3'

3)hive> select * from decisample3;
OK
1.00    2       3.00
7.00    8       9.00

4)hive> select * from decisample;
OK
4.00    5       6.00

5) query:- 
select a1.a, '' as a1b,'' as a1c from decisample a1 union all select 
a2.a,a2.b,a2.c from decisample3 a2;

o/p:-
OK
4.00            NULL
1.00    2       3.000000000000000000
7.00    8       9.000000000000000000
Time taken: 87.993 seconds, Fetched: 3 row(s)

6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' 
as a1c from decisample a1;

o/p:-
4.00
1.00    2       3
7.00    8       9

Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no 
trailing  zero.

Observation:

1. Hive is trying to run the UNION ALL after ensuring the SELECT's are 
semantically same(equal number of columns and same datatypes). To do this, it 
is implicitly type casting the values where required.

>From the explain plan, type casting is not consistent when done 2 different 
>ways:
a)  select-1  UNION ALL select-2 (Query-5 in above comment)
vs
b) select-2 UNION ALL select-2   (Query-6 in above comment)

Showing only the "expresssions" part of execution plans

Query-5:
========
..
..
Map Operator Tree:
          TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), null 
(type: decimal(38,18))
              outputColumnNames: _col0, _col1, _col2
..
..
TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), CAST( c AS decimal(38,18)) (type: decimal(38,18))


Query-6:
========
..
..
Map Operator Tree:
          TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), UDFToString(c) (type: string)
..
..
TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), '' (type: 
string)
..
..

Attaching the execution plans for both queries for reference.

2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
And by default, the precision and scale are taken as (38,18) in Hive. This 
could be the reason for 18 zeros.

3. This is repeating every time implicit type casting is happening on EMPTY 
strings.

If excluding few columns in one of the SELECT statement is absolutely 
necessary, then the only Workaround is to explicitly type cast the empty 
strings to same Datatypes as the Other Select statement which included the 
columns.


For ex:
Q1:

select a,b,c from decisample3
union all
select a,cast(' ' as int),cast(' ' as decimal) from decisample;

Q2:

select a,cast(' ' as int),cast(' ' as decimal) from decisample
union all
select a,b,c from decisample3;

Both the above queries will give consistent result now.

cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to