[ https://issues.apache.org/jira/browse/HIVE-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aditya Allamraju updated HIVE-17355: ------------------------------------ Description: Extra trailing zeros are added when running "union all" on the tables containing decimal data types. *Version:* Hive 2.1 *Steps to repro:-* {code:java} 1) CREATE TABLE `decisample`( `a` decimal(8,2), `b` int, `c` decimal(5,2)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/user/hive/warehouse/decisample' 2) CREATE TABLE `decisample3`( `a` decimal(8,2), `b` int, `c` decimal(5,2)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/user/hive/warehouse/decisample3' 3)hive> select * from decisample3; OK 1.00 2 3.00 7.00 8 9.00 4)hive> select * from decisample; OK 4.00 5 6.00 5) query:- select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2; o/p:- OK 4.00 NULL 1.00 2 3.000000000000000000 7.00 8 9.000000000000000000 Time taken: 87.993 seconds, Fetched: 3 row(s) 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1; o/p:- 4.00 1.00 2 3 7.00 8 9 {code} Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing zero. Observation: 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required. >From the explain plan, type casting is not consistent when done 2 different >ways: a) select-1 UNION ALL select-2 (Query-5 in above comment) vs b) select-2 UNION ALL select-2 (Query-6 in above comment) Showing only the "expresssions" part of execution plans Query-5: ======== {code:java} .. .. Map Operator Tree: TableScan alias: a1 Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18)) outputColumnNames: _col0, _col1, _col2 .. .. TableScan alias: a2 Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18)) {code} Query-6: ======== {code:java} .. .. Map Operator Tree: TableScan alias: a2 Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string) .. .. TableScan alias: a1 Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string) .. .. {code} Attaching the execution plans for both queries for reference. 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal. And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros. 3. This is repeating every time implicit type casting is happening on EMPTY strings. If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns. For ex: Q1: select a,b,c from decisample3 union all select a,cast(' ' as int),cast(' ' as decimal) from decisample; Q2: select a,cast(' ' as int),cast(' ' as decimal) from decisample union all select a,b,c from decisample3; Both the above queries will give consistent result now. cast(' ' as int) ---> this was cast to INT, same as datatype of "b" cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c" was: Extra trailing zeros are added when running "union all" on the tables containing decimal data types. Version: Hive 2.1 Steps to repro:- ================ 1) CREATE TABLE `decisample`( `a` decimal(8,2), `b` int, `c` decimal(5,2)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/user/hive/warehouse/decisample' 2) CREATE TABLE `decisample3`( `a` decimal(8,2), `b` int, `c` decimal(5,2)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/user/hive/warehouse/decisample3' 3)hive> select * from decisample3; OK 1.00 2 3.00 7.00 8 9.00 4)hive> select * from decisample; OK 4.00 5 6.00 5) query:- select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2; o/p:- OK 4.00 NULL 1.00 2 3.000000000000000000 7.00 8 9.000000000000000000 Time taken: 87.993 seconds, Fetched: 3 row(s) 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1; o/p:- 4.00 1.00 2 3 7.00 8 9 Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing zero. Observation: 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required. >From the explain plan, type casting is not consistent when done 2 different >ways: a) select-1 UNION ALL select-2 (Query-5 in above comment) vs b) select-2 UNION ALL select-2 (Query-6 in above comment) Showing only the "expresssions" part of execution plans Query-5: ======== .. .. Map Operator Tree: TableScan alias: a1 Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18)) outputColumnNames: _col0, _col1, _col2 .. .. TableScan alias: a2 Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18)) Query-6: ======== .. .. Map Operator Tree: TableScan alias: a2 Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string) .. .. TableScan alias: a1 Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string) .. .. Attaching the execution plans for both queries for reference. 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal. And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros. 3. This is repeating every time implicit type casting is happening on EMPTY strings. If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns. For ex: Q1: select a,b,c from decisample3 union all select a,cast(' ' as int),cast(' ' as decimal) from decisample; Q2: select a,cast(' ' as int),cast(' ' as decimal) from decisample union all select a,b,c from decisample3; Both the above queries will give consistent result now. cast(' ' as int) ---> this was cast to INT, same as datatype of "b" cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c" > Casting to Decimal along with UNION ALL gives incosistent results > ----------------------------------------------------------------- > > Key: HIVE-17355 > URL: https://issues.apache.org/jira/browse/HIVE-17355 > Project: Hive > Issue Type: Bug > Components: Parser, UDF > Affects Versions: 2.1.0, 2.1.1 > Environment: CentOS 7.2 > Reporter: Aditya Allamraju > Assignee: Aditya Allamraju > > Extra trailing zeros are added when running "union all" on the tables > containing decimal data types. > *Version:* Hive 2.1 > *Steps to repro:-* > {code:java} > 1) CREATE TABLE `decisample`( > `a` decimal(8,2), > `b` int, > `c` decimal(5,2)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'maprfs:/user/hive/warehouse/decisample' > 2) CREATE TABLE `decisample3`( > `a` decimal(8,2), > `b` int, > `c` decimal(5,2)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'maprfs:/user/hive/warehouse/decisample3' > 3)hive> select * from decisample3; > OK > 1.00 2 3.00 > 7.00 8 9.00 > 4)hive> select * from decisample; > OK > 4.00 5 6.00 > 5) query:- > select a1.a, '' as a1b,'' as a1c from decisample a1 union all select > a2.a,a2.b,a2.c from decisample3 a2; > o/p:- > OK > 4.00 NULL > 1.00 2 3.000000000000000000 > 7.00 8 9.000000000000000000 > Time taken: 87.993 seconds, Fetched: 3 row(s) > 6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as > a1b,'' as a1c from decisample a1; > o/p:- > 4.00 > 1.00 2 3 > 7.00 8 9 > {code} > Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no > trailing zero. > Observation: > 1. Hive is trying to run the UNION ALL after ensuring the SELECT's are > semantically same(equal number of columns and same datatypes). To do this, it > is implicitly type casting the values where required. > From the explain plan, type casting is not consistent when done 2 different > ways: > a) select-1 UNION ALL select-2 (Query-5 in above comment) > vs > b) select-2 UNION ALL select-2 (Query-6 in above comment) > Showing only the "expresssions" part of execution plans > Query-5: > ======== > {code:java} > .. > .. > Map Operator Tree: > TableScan > alias: a1 > Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: a (type: decimal(8,2)), '' (type: string), null > (type: decimal(38,18)) > outputColumnNames: _col0, _col1, _col2 > .. > .. > TableScan > alias: a2 > Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: a (type: decimal(8,2)), UDFToString(b) (type: > string), CAST( c AS decimal(38,18)) (type: decimal(38,18)) > {code} > Query-6: > ======== > {code:java} > .. > .. > Map Operator Tree: > TableScan > alias: a2 > Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: a (type: decimal(8,2)), UDFToString(b) (type: > string), UDFToString(c) (type: string) > .. > .. > TableScan > alias: a1 > Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: a (type: decimal(8,2)), '' (type: string), '' > (type: string) > .. > .. > {code} > Attaching the execution plans for both queries for reference. > 2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal. > And by default, the precision and scale are taken as (38,18) in Hive. This > could be the reason for 18 zeros. > 3. This is repeating every time implicit type casting is happening on EMPTY > strings. > If excluding few columns in one of the SELECT statement is absolutely > necessary, then the only Workaround is to explicitly type cast the empty > strings to same Datatypes as the Other Select statement which included the > columns. > For ex: > Q1: > select a,b,c from decisample3 > union all > select a,cast(' ' as int),cast(' ' as decimal) from decisample; > Q2: > select a,cast(' ' as int),cast(' ' as decimal) from decisample > union all > select a,b,c from decisample3; > Both the above queries will give consistent result now. > cast(' ' as int) ---> this was cast to INT, same as datatype of "b" > cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c" -- This message was sent by Atlassian JIRA (v6.4.14#64029)