[ https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Tamas updated DATASKETCHES-8: ---------------------------------- Description: Using ds_hll Hive is not counting empty strings as distinct values for string and varchar columns. Example: With a t table with the following (string, char(1), varchar(1)) values: {code:java} +------+------+------+ | t.s | t.c | t.v | +------+------+------+ | | | | | a | a | a | | | | | | a | a | a | | s | s | s | | d | d | d | +------+------+------+ {code} select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from t; {code:java} +--------------------+--------------------+--------------------+ | _c0 | _c1 | _c2 | +--------------------+--------------------+--------------------+ | 3.000000014901161 | 4.000000029802323 | 3.000000014901161 | +--------------------+--------------------+--------------------+ {code} Could be a problem here: https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351 Char is working because it is filled with spaces up to the limit. was: Using ds_hll Hive is not counting empty strings as distinct values for string and varchar columns. Example: With a t table with the following (string, char(1), varchar(1)) values: {code:java} +------+------+------+ | t.s | t.c | t.v | +------+------+------+ | | | | | a | a | a | | | | | | a | a | a | | s | s | s | | d | d | d | +------+------+------+ {code} select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from t; {code:java} +--------------------+--------------------+--------------------+ | _c0 | _c1 | _c2 | +--------------------+--------------------+--------------------+ | 3.000000014901161 | 4.000000029802323 | 3.000000014901161 | +--------------------+--------------------+--------------------+ {code} Could be a problem here: https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351 but for char it is working fine. > HLL doesn't take empty strings as distinct values > ------------------------------------------------- > > Key: DATASKETCHES-8 > URL: https://issues.apache.org/jira/browse/DATASKETCHES-8 > Project: Apache Datasketches > Issue Type: Bug > Reporter: Adam Tamas > Priority: Major > > Using ds_hll Hive is not counting empty strings as distinct values for string > and varchar columns. > Example: > With a t table with the following (string, char(1), varchar(1)) values: > {code:java} > +------+------+------+ > | t.s | t.c | t.v | > +------+------+------+ > | | | | > | a | a | a | > | | | | > | a | a | a | > | s | s | s | > | d | d | d | > +------+------+------+ > {code} > select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), > ds_hll_estimate(ds_hll_sketch(v)) from t; > {code:java} > +--------------------+--------------------+--------------------+ > | _c0 | _c1 | _c2 | > +--------------------+--------------------+--------------------+ > | 3.000000014901161 | 4.000000029802323 | 3.000000014901161 | > +--------------------+--------------------+--------------------+ > {code} > Could be a problem here: > https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351 > Char is working because it is filled with spaces up to the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org For additional commands, e-mail: dev-h...@datasketches.apache.org