[jira] [Updated] (HIVE-17355) Casting to Decimal along with UNION ALL gives incosistent results

Aditya Allamraju (JIRA) Fri, 18 Aug 2017 03:19:30 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aditya Allamraju updated HIVE-17355:
------------------------------------
    Description: 
Extra trailing zeros are added when running "union all" on the tables 
containing decimal data types.

*Version:* Hive 2.1

*Steps to repro:-*

{code:java}

1) CREATE TABLE `decisample`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample'

2) CREATE TABLE `decisample3`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample3'

3)hive> select * from decisample3;
OK
1.00    2       3.00
7.00    8       9.00

4)hive> select * from decisample;
OK
4.00    5       6.00

5) query:- 
select a1.a, '' as a1b,'' as a1c from decisample a1 union all select 
a2.a,a2.b,a2.c from decisample3 a2;

o/p:-
OK
4.00            NULL
1.00    2       3.000000000000000000
7.00    8       9.000000000000000000
Time taken: 87.993 seconds, Fetched: 3 row(s)

6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' 
as a1c from decisample a1;

o/p:-
4.00
1.00    2       3
7.00    8       9

{code}

Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no 
trailing  zero.

Observation:

1. Hive is trying to run the UNION ALL after ensuring the SELECT's are 
semantically same(equal number of columns and same datatypes). To do this, it 
is implicitly type casting the values where required.

>From the explain plan, type casting is not consistent when done 2 different 
>ways:
a)  select-1  UNION ALL select-2 (Query-5 in above comment)
vs
b) select-2 UNION ALL select-2   (Query-6 in above comment)

Showing only the "expresssions" part of execution plans

Query-5:
========

{code:java}
..
..
Map Operator Tree:
          TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), null 
(type: decimal(38,18))
              outputColumnNames: _col0, _col1, _col2
..
..
TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), CAST( c AS decimal(38,18)) (type: decimal(38,18))


{code}

Query-6:
========

{code:java}
..
..
Map Operator Tree:
          TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), UDFToString(c) (type: string)
..
..
TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), '' (type: 
string)
..
..


{code}
Attaching the execution plans for both queries for reference.

2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
And by default, the precision and scale are taken as (38,18) in Hive. This 
could be the reason for 18 zeros.

3. This is repeating every time implicit type casting is happening on EMPTY 
strings.

If excluding few columns in one of the SELECT statement is absolutely 
necessary, then the only Workaround is to explicitly type cast the empty 
strings to same Datatypes as the Other Select statement which included the 
columns.


For ex:
Q1:

select a,b,c from decisample3
union all
select a,cast(' ' as int),cast(' ' as decimal) from decisample;

Q2:

select a,cast(' ' as int),cast(' ' as decimal) from decisample
union all
select a,b,c from decisample3;

Both the above queries will give consistent result now.

cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"

  was:
Extra trailing zeros are added when running "union all" on the tables 
containing decimal data types.

Version: Hive 2.1

Steps to repro:-
================

1) CREATE TABLE `decisample`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample'

2) CREATE TABLE `decisample3`(
  `a` decimal(8,2),
  `b` int,
  `c` decimal(5,2))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/decisample3'

3)hive> select * from decisample3;
OK
1.00    2       3.00
7.00    8       9.00

4)hive> select * from decisample;
OK
4.00    5       6.00

5) query:- 
select a1.a, '' as a1b,'' as a1c from decisample a1 union all select 
a2.a,a2.b,a2.c from decisample3 a2;

o/p:-
OK
4.00            NULL
1.00    2       3.000000000000000000
7.00    8       9.000000000000000000
Time taken: 87.993 seconds, Fetched: 3 row(s)

6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' 
as a1c from decisample a1;

o/p:-
4.00
1.00    2       3
7.00    8       9

Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no 
trailing  zero.

Observation:

1. Hive is trying to run the UNION ALL after ensuring the SELECT's are 
semantically same(equal number of columns and same datatypes). To do this, it 
is implicitly type casting the values where required.

>From the explain plan, type casting is not consistent when done 2 different 
>ways:
a)  select-1  UNION ALL select-2 (Query-5 in above comment)
vs
b) select-2 UNION ALL select-2   (Query-6 in above comment)

Showing only the "expresssions" part of execution plans

Query-5:
========
..
..
Map Operator Tree:
          TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), null 
(type: decimal(38,18))
              outputColumnNames: _col0, _col1, _col2
..
..
TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), CAST( c AS decimal(38,18)) (type: decimal(38,18))


Query-6:
========
..
..
Map Operator Tree:
          TableScan
            alias: a2
            Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
string), UDFToString(c) (type: string)
..
..
TableScan
            alias: a1
            Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: a (type: decimal(8,2)), '' (type: string), '' (type: 
string)
..
..

Attaching the execution plans for both queries for reference.

2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
And by default, the precision and scale are taken as (38,18) in Hive. This 
could be the reason for 18 zeros.

3. This is repeating every time implicit type casting is happening on EMPTY 
strings.

If excluding few columns in one of the SELECT statement is absolutely 
necessary, then the only Workaround is to explicitly type cast the empty 
strings to same Datatypes as the Other Select statement which included the 
columns.


For ex:
Q1:

select a,b,c from decisample3
union all
select a,cast(' ' as int),cast(' ' as decimal) from decisample;

Q2:

select a,cast(' ' as int),cast(' ' as decimal) from decisample
union all
select a,b,c from decisample3;

Both the above queries will give consistent result now.

cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"


> Casting to Decimal along with UNION ALL gives incosistent results
> -----------------------------------------------------------------
>
>                 Key: HIVE-17355
>                 URL: https://issues.apache.org/jira/browse/HIVE-17355
>             Project: Hive
>          Issue Type: Bug
>          Components: Parser, UDF
>    Affects Versions: 2.1.0, 2.1.1
>         Environment: CentOS 7.2
>            Reporter: Aditya Allamraju
>            Assignee: Aditya Allamraju
>
> Extra trailing zeros are added when running "union all" on the tables 
> containing decimal data types.
> *Version:* Hive 2.1
> *Steps to repro:-*
> {code:java}
> 1) CREATE TABLE `decisample`(
>   `a` decimal(8,2),
>   `b` int,
>   `c` decimal(5,2))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/user/hive/warehouse/decisample'
> 2) CREATE TABLE `decisample3`(
>   `a` decimal(8,2),
>   `b` int,
>   `c` decimal(5,2))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/user/hive/warehouse/decisample3'
> 3)hive> select * from decisample3;
> OK
> 1.00    2       3.00
> 7.00    8       9.00
> 4)hive> select * from decisample;
> OK
> 4.00    5       6.00
> 5) query:- 
> select a1.a, '' as a1b,'' as a1c from decisample a1 union all select 
> a2.a,a2.b,a2.c from decisample3 a2;
> o/p:-
> OK
> 4.00            NULL
> 1.00    2       3.000000000000000000
> 7.00    8       9.000000000000000000
> Time taken: 87.993 seconds, Fetched: 3 row(s)
> 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as 
> a1b,'' as a1c from decisample a1;
> o/p:-
> 4.00
> 1.00    2       3
> 7.00    8       9
> {code}
> Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no 
> trailing  zero.
> Observation:
> 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are 
> semantically same(equal number of columns and same datatypes). To do this, it 
> is implicitly type casting the values where required.
> From the explain plan, type casting is not consistent when done 2 different 
> ways:
> a)  select-1  UNION ALL select-2 (Query-5 in above comment)
> vs
> b) select-2 UNION ALL select-2   (Query-6 in above comment)
> Showing only the "expresssions" part of execution plans
> Query-5:
> ========
> {code:java}
> ..
> ..
> Map Operator Tree:
>           TableScan
>             alias: a1
>             Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE 
> Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), '' (type: string), null 
> (type: decimal(38,18))
>               outputColumnNames: _col0, _col1, _col2
> ..
> ..
> TableScan
>             alias: a2
>             Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE 
> Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
> string), CAST( c AS decimal(38,18)) (type: decimal(38,18))
> {code}
> Query-6:
> ========
> {code:java}
> ..
> ..
> Map Operator Tree:
>           TableScan
>             alias: a2
>             Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE 
> Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), UDFToString(b) (type: 
> string), UDFToString(c) (type: string)
> ..
> ..
> TableScan
>             alias: a1
>             Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE 
> Column stats: NONE
>             Select Operator
>               expressions: a (type: decimal(8,2)), '' (type: string), '' 
> (type: string)
> ..
> ..
> {code}
> Attaching the execution plans for both queries for reference.
> 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
> And by default, the precision and scale are taken as (38,18) in Hive. This 
> could be the reason for 18 zeros.
> 3. This is repeating every time implicit type casting is happening on EMPTY 
> strings.
> If excluding few columns in one of the SELECT statement is absolutely 
> necessary, then the only Workaround is to explicitly type cast the empty 
> strings to same Datatypes as the Other Select statement which included the 
> columns.
> For ex:
> Q1:
> select a,b,c from decisample3
> union all
> select a,cast(' ' as int),cast(' ' as decimal) from decisample;
> Q2:
> select a,cast(' ' as int),cast(' ' as decimal) from decisample
> union all
> select a,b,c from decisample3;
> Both the above queries will give consistent result now.
> cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
> cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-17355) Casting to Decimal along with UNION ALL gives incosistent results

Reply via email to