[ 
https://issues.apache.org/jira/browse/HIVE-28363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28363:
----------------------------------
    Labels: pull-request-available  (was: )

> Improve heuristics of FilterStatsRule without column stats
> ----------------------------------------------------------
>
>                 Key: HIVE-28363
>                 URL: https://issues.apache.org/jira/browse/HIVE-28363
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>              Labels: pull-request-available
>
> HIVE-13287 gave a better estimation of the selectivity of IN operators, 
> especially when column stats are available. This ticket would try to improve 
> the case where column stats are unavailable.
>  
> This is an example. The table has ten rows and no column stats on `id`.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
> ...
> +------------------------+-----------------------------+
> |    column_property     |            value            |
> +------------------------+-----------------------------+
> | col_name               | id                          |
> | data_type              | int                         |
> | min                    |                             |
> | max                    |                             |
> | num_nulls              |                             |
> | distinct_count         |                             |
> | avg_col_len            |                             |
> | max_col_len            |                             |
> | num_trues              |                             |
> | num_falses             |                             |
> | bit_vector             |                             |
> | comment                | from deserializer           |
> | COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
> +------------------------+-----------------------------+{code}
> With a single needle, the estimated number becomes 10 * 0.5 = 5 because of 
> the fallback heuristics.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1);
> ...
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id = 1) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id = 1) (type: boolean) |
> |                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
> COMPLETE Column stats: NONE | {code}
> The size is estimated to be the original size with two or more needles. The 
> heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I 
> believe users expect to observe some reduction when using IN.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1, 2);
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id) IN (1, 2) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id) IN (1, 2) (type: boolean) |
> |                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE | {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to