[ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832769&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832769
 ]

ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Dec/22 14:52
            Start Date: 12/Dec/22 14:52
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1045927909


##########
ql/src/test/results/clientpositive/beeline/colstats_all_nulls.q.out:
##########
@@ -73,6 +74,7 @@ max_col_len
 num_trues      
 num_falses     
 bit_vector     HL
+histogram      

Review Comment:
   We have two config knobs for enabling histogram/KLL, one is 
`hive.stats.kll.enable` (for turning it on/off during statistics computation), 
the other is `metastore.stats.fetch.kll` (which turns on/off the retrieval from 
the metastore).
   
   If we were to implement this, would you show/hide the **histogram** line for 
the `DESCRIBE FORMATTED` command based on the value of 
`metastore.stats.fetch.kll` alone?
   
   I thought of this at some point, but the fact that it hasn't been done for 
HLL even in the first PR refrained me from implementing it.
   
   The reason is that we would end up having an inconsistent behaviour w.r.t. 
HLL which is always displayed, no matter what the configuration is for the 
corresponding `metastore.stats.fetch.bitvector` parameter.
   
   This said, I don't have strong feelings either way and I am fine 
implementing the hiding mechanism and it wouldn't take much, I just wanted to 
expose my reasoning so far to have all the pros/cons on the table.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 832769)
    Time Spent: 10h 10m  (was: 10h)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to