[jira] [Updated] (HIVE-28363) Improve heuristics of FilterStatsRule without column stats

Shohei Okumiya (Jira) Mon, 08 Jul 2024 01:22:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shohei Okumiya updated HIVE-28363:
----------------------------------
    Description: 
HIVE-13287 gave a better estimation on the selectivity of IN operators. This 
ticket would try

 

This is an example. The table has ten rows and no column stats on `id`.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
...
+------------------------+-----------------------------+
|    column_property     |            value            |
+------------------------+-----------------------------+
| col_name               | id                          |
| data_type              | int                         |
| min                    |                             |
| max                    |                             |
| num_nulls              |                             |
| distinct_count         |                             |
| avg_col_len            |                             |
| max_col_len            |                             |
| num_trues              |                             |
| num_falses             |                             |
| bit_vector             |                             |
| comment                | from deserializer           |
| COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
+------------------------+-----------------------------+{code}
With a single needle, the estimated number becomes 10 * 0.5 = 5 because of the 
fallback heuristics.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1);
...
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id = 1) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id = 1) (type: boolean) |
|                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE | {code}
The size is estimated to be the original size with two or more needles. The 
heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I believe 
users expect to observe some reduction when they use IN.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1, 2);
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id) IN (1, 2) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id) IN (1, 2) (type: boolean) |
|                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE | {code}
 

  was:
HIVE-13287 enables Hive to estimate the selectivity of IN operators better. It 
also changed the heuristics when column stats were unavailable. In my 
observation, the original one would give a better estimation in that case.

 

This is an example. The table has ten rows and no column stats on `id`.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
...
+------------------------+-----------------------------+
|    column_property     |            value            |
+------------------------+-----------------------------+
| col_name               | id                          |
| data_type              | int                         |
| min                    |                             |
| max                    |                             |
| num_nulls              |                             |
| distinct_count         |                             |
| avg_col_len            |                             |
| max_col_len            |                             |
| num_trues              |                             |
| num_falses             |                             |
| bit_vector             |                             |
| comment                | from deserializer           |
| COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
+------------------------+-----------------------------+{code}
With a single needle, the estimated number becomes 10 * 0.5 = 5 because of the 
fallback heuristics.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1);
...
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id = 1) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id = 1) (type: boolean) |
|                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE | {code}
The size is estimated to be the original size with two or more needles. The 
heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I believe 
users expect to observe some reduction when they use IN.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1, 2);
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id) IN (1, 2) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id) IN (1, 2) (type: boolean) |
|                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE | {code}
 


> Improve heuristics of FilterStatsRule without column stats
> ----------------------------------------------------------
>
>                 Key: HIVE-28363
>                 URL: https://issues.apache.org/jira/browse/HIVE-28363
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>
> HIVE-13287 gave a better estimation on the selectivity of IN operators. This 
> ticket would try
>  
> This is an example. The table has ten rows and no column stats on `id`.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
> ...
> +------------------------+-----------------------------+
> |    column_property     |            value            |
> +------------------------+-----------------------------+
> | col_name               | id                          |
> | data_type              | int                         |
> | min                    |                             |
> | max                    |                             |
> | num_nulls              |                             |
> | distinct_count         |                             |
> | avg_col_len            |                             |
> | max_col_len            |                             |
> | num_trues              |                             |
> | num_falses             |                             |
> | bit_vector             |                             |
> | comment                | from deserializer           |
> | COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
> +------------------------+-----------------------------+{code}
> With a single needle, the estimated number becomes 10 * 0.5 = 5 because of 
> the fallback heuristics.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1);
> ...
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id = 1) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id = 1) (type: boolean) |
> |                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
> COMPLETE Column stats: NONE | {code}
> The size is estimated to be the original size with two or more needles. The 
> heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I 
> believe users expect to observe some reduction when they use IN.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1, 2);
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id) IN (1, 2) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id) IN (1, 2) (type: boolean) |
> |                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE | {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28363) Improve heuristics of FilterStatsRule without column stats

Reply via email to