[ https://issues.apache.org/jira/browse/HIVE-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053607#comment-17053607 ]
David Mollitor edited comment on HIVE-22993 at 3/6/20, 5:24 PM: ---------------------------------------------------------------- [~gopalv] Thanks. Do you know what JIRA introduced this change? I have been testing on HDP 3.1 Edit: Can this BIT_VECTOR field be applied to this request for better stats on INSERT? was (Author: belugabehr): [~gopalv] Thanks. Do you know what JIRA introduced this change? I have been testing on HDP 3.1 > Include Bloom Filter in Column Statistics to Better Estimate nDV > ---------------------------------------------------------------- > > Key: HIVE-22993 > URL: https://issues.apache.org/jira/browse/HIVE-22993 > Project: Hive > Issue Type: Improvement > Components: CBO, Statistics > Reporter: David Mollitor > Priority: Major > > When performing an INSERT statement, Hive has no way to determine the number > of distinct values since the distinct values themselves are not recorded. > {code:sql} > create table test_mm(`id` int, `my_dt` date); > insert into test_mm values (1, "2018-10-01"), (2, "2018-10-01"), (3, > "2018-10-01"), > (4, "2017-10-01"), (5, "2017-10-01"), (6, "2017-10-01"), > (7, "2010-10-01"), (8, "2010-10-01"), (9, "2010-10-01"), > (10, "1998-10-01"), (11, "1998-10-01"), (12, "1998-10-01"); > DESCRIBE FORMATTED test_mm my_dt; > -- distinct_count: 4 > insert into test_mm values (13, "2030-10-01"), (14, "2030-10-01"), (15, > "2030-10-01"); > DESCRIBE FORMATTED test_mm my_dt; > -- distinct_count: 4 > {code} > The first INSERT statement sees that there are 0 records, so it makes sense > that any distinct values marked in the statistics. However, for the second > INSERT, Hive has no idea if "2030-10-01" is distinct, so the distinct_count > is unchanged. By introducing a bloom filter for column statistics, the > second INSERT may be able to determine that "2030-10-01" is indeed unique and > update the distinct_count accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)