[ https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lee Rhodes updated DATASKETCHES-8: ---------------------------------- Priority: Trivial (was: Major) > HLL doesn't take empty strings as distinct values > ------------------------------------------------- > > Key: DATASKETCHES-8 > URL: https://issues.apache.org/jira/browse/DATASKETCHES-8 > Project: Apache Datasketches > Issue Type: Wish > Reporter: Adam Tamas > Assignee: Lee Rhodes > Priority: Trivial > > Using ds_hll Hive is not counting empty strings as distinct values for string > and varchar columns. > Example: > With a t table with the following (string, char(1), varchar(1)) values: > {code:java} > +------+------+------+ > | t.s | t.c | t.v | > +------+------+------+ > | | | | > | a | a | a | > | | | | > | a | a | a | > | s | s | s | > | d | d | d | > +------+------+------+ > {code} > select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), > ds_hll_estimate(ds_hll_sketch(v)) from t; > {code:java} > +--------------------+--------------------+--------------------+ > | _c0 | _c1 | _c2 | > +--------------------+--------------------+--------------------+ > | 3.000000014901161 | 4.000000029802323 | 3.000000014901161 | > +--------------------+--------------------+--------------------+ > {code} > Could be a problem here: > https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351 > Char is working because it is filled with spaces up to the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org For additional commands, e-mail: dev-h...@datasketches.apache.org