[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

ASF GitHub Bot (Jira) Thu, 08 Dec 2022 06:19:04 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832089&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832089
 ]


ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Dec/22 14:18
            Start Date: 08/Dec/22 14:18
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on PR #3137:
URL: https://github.com/apache/hive/pull/3137#issuecomment-1342807270

   > Another question I don't see here is how we generate the histogram 
statistics? by issuing an "analyze table" command?
   
   That was hard to figure out for me too at first. Statistics computation 
happens via an aggregate query, where different `UDAF`s are used to compute the 
different statistics.
   
   
[ColumnStatsSemanticAnalyzer.java#L308-L325](https://github.com/apache/hive/blob/1e9e51dbb5ab5acd4d5a05eff31752a5997beb03/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java#L308-L325)
 generates the `SELECT` statement for the stats.
   
   It's then calling 
[ColumnStatsSemanticAnalyzer.java#L327](https://github.com/apache/hive/blob/1e9e51dbb5ab5acd4d5a05eff31752a5997beb03/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java#L327)
 which has an enum with the different statistics, what we did was to add a new 
one for histograms and generated the code accordingly (see 
[ColumnStatsSemanticAnalyzer.java#L355-L357](https://github.com/apache/hive/blob/1e9e51dbb5ab5acd4d5a05eff31752a5997beb03/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java#L355-L357)).
   
   Finally, the UDAF part is generated here: 
[ColumnStatsSemanticAnalyzer.java#L494-L519](https://github.com/apache/hive/blob/1e9e51dbb5ab5acd4d5a05eff31752a5997beb03/ql/src/java/org/apache/hadoop/hive/ql/parse/ColumnStatsSemanticAnalyzer.java#L494-L519).




Issue Time Tracking
-------------------

    Worklog Id:     (was: 832089)
    Time Spent: 6h 50m  (was: 6h 40m)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

Reply via email to