[jira] [Updated] (HIVE-22759) Default decimal round off producing incorrect results in "Vectorized" mode

Harsh Maheshwari (Jira) Wed, 22 Jan 2020 03:12:54 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-22759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Harsh Maheshwari updated HIVE-22759:
------------------------------------
    Description: 
Premise: 

External table on top of a parquet file. The table contains decimal with a 
fixed scale, and there could be entries in parquet with higher scale. Hive, by 
default rounds off the values to the scale defined in table schema and uses a 
view of translated values while applying filters. 

Steps to reproduce - 

a. Create a new HDFS path and [upload attached parquet 
files|https://drive.google.com/file/d/1sUg-ovWku9X4AmmwQND2hzg5yIKVTy9q/view?usp=sharing]
 within the folder. 
 b. Create an external table managing these parquet files
{noformat}
create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, 
l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount 
decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, 
l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct string, 
l_shipmode string, l_comment string) stored as parquet location 
'hdfs://<location>'; 
{noformat}
c. Execute following query in vectorized and non-vectorized mode - 
{noformat}
select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 
0.0 and l_tax>= 0.001 and l_orderkey = 107695104;
{noformat}
 
----
Non vectorized mode produces correct result (1 row) - 
{noformat}
hive> set hive.vectorized.execution.enabled = false;
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = 
hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 
Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no 
reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> 
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0048 Hadoop 
job information for Stage-1: number of mappers: 2; number of reducers: 0 
2020-01-22 10:11:30,661 Stage-1 map = 0%,  reduce = 0% 2020-01-22 10:11:39,833 
Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.45 sec 2020-01-22 
10:11:46,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 25.22 sec 
MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job = 
job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
Cumulative CPU: 25.22 sec   HDFS Read: 318375885 HDFS Write: 310 SUCCESS Total 
MapReduce CPU Time Spent: 25 seconds 220 msec 
OK 
107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
1995-07-14 TAKE BACK RETURN FOB thely express reques
Time taken: 23.24 seconds, Fetched: 1 row(s)
{noformat}
 
----
Vectorized mode produces incorrect result (no rows) - 
{noformat}
hive> set hive.vectorized.execution.enabled = true;
 
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104;
Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1579669891475_0047, Tracking URL = <...>
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0047
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-01-22 10:11:04,474 Stage-1 map = 0%,  reduce = 0%
2020-01-22 10:11:12,651 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 5.29 sec
2020-01-22 10:11:16,735 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 20.6 
sec
MapReduce Total cumulative CPU time: 20 seconds 600 msec
Ended Job = job_1579669891475_0047
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 20.6 sec   HDFS Read: 318378365 HDFS 
Write: 174 SUCCESS
Total MapReduce CPU Time Spent: 20 seconds 600 msec
OK
Time taken: 19.822 seconds{noformat}
 

Similar behaviour is observed for many other queries. The record in parquet 
file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of scale 
in decimal points, it should match with 0.001 after rounding off and return the 
row. 

If we explicitly apply round() function to scale of 3 on l_tax column, the row 
shows up correctly. 
{noformat}
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; Query 
ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total jobs = 1 
Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no 
reduce operator Starting Job = job_1579669891475_0049, Tracking URL = <...> 
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0049 Hadoop 
job information for Stage-1: number of mappers: 2; number of reducers: 0 
2020-01-22 10:39:56,751 Stage-1 map = 0%,  reduce = 0% 2020-01-22 10:40:03,894 
Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 8.63 sec 2020-01-22 
10:40:12,044 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 27.01 sec 
MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended Job = 
job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
Cumulative CPU: 27.01 sec   HDFS Read: 318376023 HDFS Write: 310 SUCCESS Total 
MapReduce CPU Time Spent: 27 seconds 10 msec 
OK 
107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
1995-07-14 TAKE BACK RETURN FOB thely express reques 
Time taken: 22.888 seconds, Fetched: 1 row(s)
{noformat}
----
 
 This proves, there's an issue with default round up, in vectorisation mode.

  was:
Premise: 

External table on top of a parquet file. The table contains decimal with a 
fixed scale, and there could be entries in parquet with higher scale. Hive, by 
default rounds off the values to the scale defined in table schema and uses a 
view of translated values while applying filters. 

Steps to reproduce - 

a. Create a new HDFS path and upload attached parquet files within the folder. 
b. Create an external table managing these parquet files
{noformat}
create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, 
l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount 
decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, 
l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct string, 
l_shipmode string, l_comment string) stored as parquet location 
'hdfs://<location>'; 
{noformat}
c. Execute following query in vectorized and non-vectorized mode - 
{noformat}
select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount >= 
0.0 and l_tax>= 0.001 and l_orderkey = 107695104;
{noformat}
 
----
Non vectorized mode produces correct result (1 row) - 
{noformat}
hive> set hive.vectorized.execution.enabled = false;
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = 
hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 
Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no 
reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> 
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0048 Hadoop 
job information for Stage-1: number of mappers: 2; number of reducers: 0 
2020-01-22 10:11:30,661 Stage-1 map = 0%,  reduce = 0% 2020-01-22 10:11:39,833 
Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.45 sec 2020-01-22 
10:11:46,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 25.22 sec 
MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job = 
job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
Cumulative CPU: 25.22 sec   HDFS Read: 318375885 HDFS Write: 310 SUCCESS Total 
MapReduce CPU Time Spent: 25 seconds 220 msec 
OK 
107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
1995-07-14 TAKE BACK RETURN FOB thely express reques
Time taken: 23.24 seconds, Fetched: 1 row(s)
{noformat}
 
----
Vectorized mode produces incorrect result (no rows) - 
{noformat}
hive> set hive.vectorized.execution.enabled = true;
 
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104;
Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1579669891475_0047, Tracking URL = <...>
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0047
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-01-22 10:11:04,474 Stage-1 map = 0%,  reduce = 0%
2020-01-22 10:11:12,651 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 5.29 sec
2020-01-22 10:11:16,735 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 20.6 
sec
MapReduce Total cumulative CPU time: 20 seconds 600 msec
Ended Job = job_1579669891475_0047
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 20.6 sec   HDFS Read: 318378365 HDFS 
Write: 174 SUCCESS
Total MapReduce CPU Time Spent: 20 seconds 600 msec
OK
Time taken: 19.822 seconds{noformat}
 

Similar behaviour is observed for many other queries. The record in parquet 
file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of scale 
in decimal points, it should match with 0.001 after rounding off and return the 
row. 



If we explicitly apply round() function to scale of 3 on l_tax column, the row 
shows up correctly. 
{noformat}
hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; Query 
ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total jobs = 1 
Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no 
reduce operator Starting Job = job_1579669891475_0049, Tracking URL = <...> 
Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0049 Hadoop 
job information for Stage-1: number of mappers: 2; number of reducers: 0 
2020-01-22 10:39:56,751 Stage-1 map = 0%,  reduce = 0% 2020-01-22 10:40:03,894 
Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 8.63 sec 2020-01-22 
10:40:12,044 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 27.01 sec 
MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended Job = 
job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
Cumulative CPU: 27.01 sec   HDFS Read: 318376023 HDFS Write: 310 SUCCESS Total 
MapReduce CPU Time Spent: 27 seconds 10 msec 
OK 
107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
1995-07-14 TAKE BACK RETURN FOB thely express reques 
Time taken: 22.888 seconds, Fetched: 1 row(s)
{noformat}
----
 
This proves, there's an issue with default round up, in vectorisation mode.


> Default decimal round off producing incorrect results in "Vectorized" mode
> --------------------------------------------------------------------------
>
>                 Key: HIVE-22759
>                 URL: https://issues.apache.org/jira/browse/HIVE-22759
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 3.1.1
>         Environment: Hive on top of hadoop, operating on top of parquet 
> files. Issue reproducible with both MR and Tez execution.
>            Reporter: Harsh Maheshwari
>            Priority: Major
>
> Premise: 
> External table on top of a parquet file. The table contains decimal with a 
> fixed scale, and there could be entries in parquet with higher scale. Hive, 
> by default rounds off the values to the scale defined in table schema and 
> uses a view of translated values while applying filters. 
> Steps to reproduce - 
> a. Create a new HDFS path and [upload attached parquet 
> files|https://drive.google.com/file/d/1sUg-ovWku9X4AmmwQND2hzg5yIKVTy9q/view?usp=sharing]
>  within the folder. 
>  b. Create an external table managing these parquet files
> {noformat}
> create table lineitem_15_3(l_orderkey int, l_suppkey int, l_linenumber int, 
> l_quantity decimal(15,3), l_extendedprice decimal(15,3), l_discount 
> decimal(15,3), l_tax decimal(15,3), l_returnflag string, l_linestatus string, 
> l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct 
> string, l_shipmode string, l_comment string) stored as parquet location 
> 'hdfs://<location>'; 
> {noformat}
> c. Execute following query in vectorized and non-vectorized mode - 
> {noformat}
> select * from lineitem_15_3 where l_extendedprice <= 918.031 and l_discount 
> >= 0.0 and l_tax>= 0.001 and l_orderkey = 107695104;
> {noformat}
>  
> ----
> Non vectorized mode produces correct result (1 row) - 
> {noformat}
> hive> set hive.vectorized.execution.enabled = false;
> hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
> l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104; Query ID = 
> hive_20200122101124_3734cfab-6cbb-4562-97ee-71e15775babe Total jobs = 1 
> Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no 
> reduce operator Starting Job = job_1579669891475_0048, Tracking URL = <...> 
> Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0048 
> Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 
> 0 2020-01-22 10:11:30,661 Stage-1 map = 0%,  reduce = 0% 2020-01-22 
> 10:11:39,833 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.45 sec 
> 2020-01-22 10:11:46,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
> 25.22 sec MapReduce Total cumulative CPU time: 25 seconds 220 msec Ended Job 
> = job_1579669891475_0048 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
> Cumulative CPU: 25.22 sec   HDFS Read: 318375885 HDFS Write: 310 SUCCESS 
> Total MapReduce CPU Time Spent: 25 seconds 220 msec 
> OK 
> 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
> 1995-07-14 TAKE BACK RETURN FOB thely express reques
> Time taken: 23.24 seconds, Fetched: 1 row(s)
> {noformat}
>  
> ----
> Vectorized mode produces incorrect result (no rows) - 
> {noformat}
> hive> set hive.vectorized.execution.enabled = true;
>  
> hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
> l_discount >= 0.0 and l_tax >= 0.001 and l_orderkey = 107695104;
> Query ID = hive_20200122101058_3cda8d56-0885-4c31-aff4-6686a84f639e
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1579669891475_0047, Tracking URL = <...>
> Kill Command = /opt/hadoop/bin/mapred job  -kill job_1579669891475_0047
> Hadoop job information for Stage-1: number of mappers: 2; number of reducers: > 0
> 2020-01-22 10:11:04,474 Stage-1 map = 0%,  reduce = 0%
> 2020-01-22 10:11:12,651 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 5.29 
> sec
> 2020-01-22 10:11:16,735 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 20.6 
> sec
> MapReduce Total cumulative CPU time: 20 seconds 600 msec
> Ended Job = job_1579669891475_0047
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 2   Cumulative CPU: 20.6 sec   HDFS Read: 318378365 HDFS 
> Write: 174 SUCCESS
> Total MapReduce CPU Time Spent: 20 seconds 600 msec
> OK
> Time taken: 19.822 seconds{noformat}
>  
> Similar behaviour is observed for many other queries. The record in parquet 
> file has a value of 0.00075 for l_tax. Since, schema defines 3 digits of 
> scale in decimal points, it should match with 0.001 after rounding off and 
> return the row. 
> If we explicitly apply round() function to scale of 3 on l_tax column, the 
> row shows up correctly. 
> {noformat}
> hive> select * from lineitem_15_3 where l_extendedprice <= 918.031 and 
> l_discount >= 0.0 and round(l_tax,3) >= 0.001 and l_orderkey = 107695104; 
> Query ID = hive_20200122103950_c5d579b8-770f-441d-bfaa-74c84ac3c515 Total 
> jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since 
> there's no reduce operator Starting Job = job_1579669891475_0049, Tracking 
> URL = <...> Kill Command = /opt/hadoop/bin/mapred job  -kill 
> job_1579669891475_0049 Hadoop job information for Stage-1: number of mappers: 
> 2; number of reducers: 0 2020-01-22 10:39:56,751 Stage-1 map = 0%,  reduce = 
> 0% 2020-01-22 10:40:03,894 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
> 8.63 sec 2020-01-22 10:40:12,044 Stage-1 map = 100%,  reduce = 0%, Cumulative 
> CPU 27.01 sec MapReduce Total cumulative CPU time: 27 seconds 10 msec Ended 
> Job = job_1579669891475_0049 MapReduce Jobs Launched: Stage-Stage-1: Map: 2   
> Cumulative CPU: 27.01 sec   HDFS Read: 318376023 HDFS Write: 310 SUCCESS 
> Total MapReduce CPU Time Spent: 27 seconds 10 msec 
> OK 
> 107695104 263020 6 1.000 918.026 0.006 0.001 N O 1995-06-21 1995-08-08 
> 1995-07-14 TAKE BACK RETURN FOB thely express reques 
> Time taken: 22.888 seconds, Fetched: 1 row(s)
> {noformat}
> ----
>  
>  This proves, there's an issue with default round up, in vectorisation mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22759) Default decimal round off producing incorrect results in "Vectorized" mode

Reply via email to