[ https://issues.apache.org/jira/browse/HIVE-22759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harsh Maheshwari updated HIVE-22759: ------------------------------------ Description: Premise: External table on top of a parquet file. The table contains decimal with a fixed scale, and there could be entries in parquet with higher scale. Hive, by default rounds off the values to the scale defined in table schema and uses a view of translated values while applying filters. Steps to reproduce - a. Create a new HDFS path and [upload attached parquet files|https://drive.google.com/file/d/1sUg-ovWku9X4AmmwQND2hzg5yIKVTy9q/view?usp=sharing] within the folder. b. Create an external table managing these parquet files {noformat} create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct string, l_shipmode string, l_comment string) stored as parquet location 'hdfs://<location>'; {noformat} c. Execute following query in vectorized and non-vectorized mode - {noformat} select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax>= 0.001 and l_orderkey = 107695104; {noformat} ---- Non vectorized mode produces correct result (1 row) - {noformat} hive> set hive.vectorized.execution.enabled = false; hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0048 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:11:30,661 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:11:39,833 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 7.45 sec 2020-01-22 10:11:46,961 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 25.22 sec MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job = job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 25.22 sec HDFS Read: 318375885 HDFS Write: 310 SUCCESS Total MapReduce CPU Time Spent: 25 seconds 220 msec OK 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 1995-07-14 TAKE BACK RETURN FOB thely express reques Time taken: 23.24 seconds, Fetched: 1 row(s) {noformat} ---- Vectorized mode produces incorrect result (no rows) - {noformat} hive> set hive.vectorized.execution.enabled = true; hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0047, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0047 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:11:04,474 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:11:12,651 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.29 sec 2020-01-22 10:11:16,735 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 20.6 sec MapReduce Total cumulative CPU time: 20 seconds 600 msec Ended Job = job_1579669891475_0047 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 20.6 sec HDFS Read: 318378365 HDFS Write: 174 SUCCESS Total MapReduce CPU Time Spent: 20 seconds 600 msec OK Time taken: 19.822 seconds{noformat} Similar behaviour is observed for many other queries. The record in parquet file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of scale in decimal points, it should match with 0.001 after rounding off and return the row. If we explicitly apply round() function to scale of 3 on l_tax column, the row shows up correctly. {noformat} hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0049, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0049 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:39:56,751 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:40:03,894 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.63 sec 2020-01-22 10:40:12,044 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 27.01 sec MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended Job = job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 27.01 sec HDFS Read: 318376023 HDFS Write: 310 SUCCESS Total MapReduce CPU Time Spent: 27 seconds 10 msec OK 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 1995-07-14 TAKE BACK RETURN FOB thely express reques Time taken: 22.888 seconds, Fetched: 1 row(s) {noformat} ---- This proves, there's an issue with default round up, in vectorisation mode. was: Premise: External table on top of a parquet file. The table contains decimal with a fixed scale, and there could be entries in parquet with higher scale. Hive, by default rounds off the values to the scale defined in table schema and uses a view of translated values while applying filters. Steps to reproduce - a. Create a new HDFS path and upload attached parquet files within the folder. b. Create an external table managing these parquet files {noformat} create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct string, l_shipmode string, l_comment string) stored as parquet location 'hdfs://<location>'; {noformat} c. Execute following query in vectorized and non-vectorized mode - {noformat} select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax>= 0.001 and l_orderkey = 107695104; {noformat} ---- Non vectorized mode produces correct result (1 row) - {noformat} hive> set hive.vectorized.execution.enabled = false; hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0048 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:11:30,661 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:11:39,833 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 7.45 sec 2020-01-22 10:11:46,961 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 25.22 sec MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job = job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 25.22 sec HDFS Read: 318375885 HDFS Write: 310 SUCCESS Total MapReduce CPU Time Spent: 25 seconds 220 msec OK 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 1995-07-14 TAKE BACK RETURN FOB thely express reques Time taken: 23.24 seconds, Fetched: 1 row(s) {noformat} ---- Vectorized mode produces incorrect result (no rows) - {noformat} hive> set hive.vectorized.execution.enabled = true; hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0047, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0047 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:11:04,474 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:11:12,651 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.29 sec 2020-01-22 10:11:16,735 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 20.6 sec MapReduce Total cumulative CPU time: 20 seconds 600 msec Ended Job = job_1579669891475_0047 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 20.6 sec HDFS Read: 318378365 HDFS Write: 174 SUCCESS Total MapReduce CPU Time Spent: 20 seconds 600 msec OK Time taken: 19.822 seconds{noformat} Similar behaviour is observed for many other queries. The record in parquet file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of scale in decimal points, it should match with 0.001 after rounding off and return the row. If we explicitly apply round() function to scale of 3 on l_tax column, the row shows up correctly. {noformat} hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; Query ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1579669891475_0049, Tracking URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0049 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2020-01-22 10:39:56,751 Stage-1 map = 0%, reduce = 0% 2020-01-22 10:40:03,894 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.63 sec 2020-01-22 10:40:12,044 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 27.01 sec MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended Job = job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 27.01 sec HDFS Read: 318376023 HDFS Write: 310 SUCCESS Total MapReduce CPU Time Spent: 27 seconds 10 msec OK 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 1995-07-14 TAKE BACK RETURN FOB thely express reques Time taken: 22.888 seconds, Fetched: 1 row(s) {noformat} ---- This proves, there's an issue with default round up, in vectorisation mode. > Default decimal round off producing incorrect results in "Vectorized" mode > -------------------------------------------------------------------------- > > Key: HIVE-22759 > URL: https://issues.apache.org/jira/browse/HIVE-22759 > Project: Hive > Issue Type: Bug > Components: Vectorization > Affects Versions: 3.1.1 > Environment: Hive on top of hadoop, operating on top of parquet > files. Issue reproducible with both MR and Tez execution. > Reporter: Harsh Maheshwari > Priority: Major > > Premise: > External table on top of a parquet file. The table contains decimal with a > fixed scale, and there could be entries in parquet with higher scale. Hive, > by default rounds off the values to the scale defined in table schema and > uses a view of translated values while applying filters. > Steps to reproduce - > a. Create a new HDFS path and [upload attached parquet > files|https://drive.google.com/file/d/1sUg-ovWku9X4AmmwQND2hzg5yIKVTy9q/view?usp=sharing] > within the folder. > b. Create an external table managing these parquet files > {noformat} > create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, > l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount > decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, > l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct > string, l_shipmode string, l_comment string) stored as parquet location > 'hdfs://<location>'; > {noformat} > c. Execute following query in vectorized and non-vectorized mode - > {noformat} > select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount > >= 0.0 and l_tax>= 0.001 and l_orderkey = 107695104; > {noformat} > > ---- > Non vectorized mode produces correct result (1 row) - > {noformat} > hive> set hive.vectorized.execution.enabled = false; > hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and > l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = > hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 > Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no > reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> > Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0048 > Hadoop job information for Stage-1: number of mappers: 2; number of reducers: > 0 2020-01-22 10:11:30,661 Stage-1 map = 0%, reduce = 0% 2020-01-22 > 10:11:39,833 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 7.45 sec > 2020-01-22 10:11:46,961 Stage-1 map = 100%, reduce = 0%, Cumulative CPU > 25.22 sec MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job > = job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 > Cumulative CPU: 25.22 sec HDFS Read: 318375885 HDFS Write: 310 SUCCESS > Total MapReduce CPU Time Spent: 25 seconds 220 msec > OK > 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 > 1995-07-14 TAKE BACK RETURN FOB thely express reques > Time taken: 23.24 seconds, Fetched: 1 row(s) > {noformat} > > ---- > Vectorized mode produces incorrect result (no rows) - > {noformat} > hive> set hive.vectorized.execution.enabled = true; > > hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and > l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; > Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1579669891475_0047, Tracking URL = <...> > Kill Command = /opt/hadoop/bin/mapred job -kill job_1579669891475_0047 > Hadoop job information for Stage-1: number of mappers: 2; number of reducers: > 0 > 2020-01-22 10:11:04,474 Stage-1 map = 0%, reduce = 0% > 2020-01-22 10:11:12,651 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.29 > sec > 2020-01-22 10:11:16,735 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 20.6 > sec > MapReduce Total cumulative CPU time: 20 seconds 600 msec > Ended Job = job_1579669891475_0047 > MapReduce Jobs Launched: > Stage-Stage-1: Map: 2 Cumulative CPU: 20.6 sec HDFS Read: 318378365 HDFS > Write: 174 SUCCESS > Total MapReduce CPU Time Spent: 20 seconds 600 msec > OK > Time taken: 19.822 seconds{noformat} > > Similar behaviour is observed for many other queries. The record in parquet > file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of > scale in decimal points, it should match with 0.001 after rounding off and > return the row. > If we explicitly apply round() function to scale of 3 on l_tax column, the > row shows up correctly. > {noformat} > hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and > l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; > Query ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total > jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since > there's no reduce operator Starting Job = job_1579669891475_0049, Tracking > URL = <...> Kill Command = /opt/hadoop/bin/mapred job -kill > job_1579669891475_0049 Hadoop job information for Stage-1: number of mappers: > 2; number of reducers: 0 2020-01-22 10:39:56,751 Stage-1 map = 0%, reduce = > 0% 2020-01-22 10:40:03,894 Stage-1 map = 50%, reduce = 0%, Cumulative CPU > 8.63 sec 2020-01-22 10:40:12,044 Stage-1 map = 100%, reduce = 0%, Cumulative > CPU 27.01 sec MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended > Job = job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 > Cumulative CPU: 27.01 sec HDFS Read: 318376023 HDFS Write: 310 SUCCESS > Total MapReduce CPU Time Spent: 27 seconds 10 msec > OK > 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 > 1995-07-14 TAKE BACK RETURN FOB thely express reques > Time taken: 22.888 seconds, Fetched: 1 row(s) > {noformat} > ---- > > This proves, there's an issue with default round up, in vectorisation mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)