David Mollitor created HIVE-22993:
-------------------------------------

             Summary: Include Bloom Filter in Column Statistics to Better 
Estimate nDV
                 Key: HIVE-22993
                 URL: https://issues.apache.org/jira/browse/HIVE-22993
             Project: Hive
          Issue Type: Improvement
          Components: CBO, Statistics
            Reporter: David Mollitor


When performing an INSERT statement, Hive has no way to determine the number of 
distinct values since the distinct values themselves are not recorded.

{code:sql}
create table test_mm(`id` int, `my_dt` date);

insert into test_mm values (1, "2018-10-01"), (2, "2018-10-01"), (3, 
"2018-10-01"),
(4, "2017-10-01"), (5, "2017-10-01"), (6, "2017-10-01"),
(7, "2010-10-01"), (8, "2010-10-01"), (9, "2010-10-01"),
(10, "1998-10-01"), (11, "1998-10-01"), (12, "1998-10-01");

DESCRIBE FORMATTED test_mm my_dt;
-- distinct_count: 4

insert into test_mm values (13, "2030-10-01"), (14, "2030-10-01"), (15, 
"2030-10-01");

DESCRIBE FORMATTED test_mm my_dt;
-- distinct_count: 4
{code}

The first INSERT statement sees that there are 0 records, so it makes sense 
that any distinct values marked in the statistics.  However, for the second 
INSERT, Hive has no idea if "2030-10-01" is distinct, so the distinct_count is 
unchanged.  By introducing a bloom filter for column statistics, the second 
INSERT may be able to determine that "2030-10-01" is indeed unique and update 
the distinct_count accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to