[ 
https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161429#comment-17161429
 ] 

Lee Rhodes commented on DATASKETCHES-8:
---------------------------------------

This is not a problem and certainly not a bug.  It is by design.  

The way unique counting sketches work is by translating an input value into a 
hash value that acts as a proxy for the actual input value.  Simplistically, if 
that hash value is ever seen again it is discarded as a duplicate. 
 * An empty string or empty anything has no value(s) to hash. 
 * Even if it did, all empty values are equivalent.  So the total number of 
unique empty values is *one* no matter how big the stream is.  
 * Sketches are approximate algorithms by definition and designed to count very 
large streams of distinct items approximately, where an error of +/- one is 
just not a concern as it is considerably smaller than the guaranteed error of 
the sketch for large streams.
 * The fact that the DataSketches unique counting algorithms provide "exact" 
counts for very small streams is essentially a "gift" as not all unique 
counting sketch algorithms even provide that.
 * If all of the streams you are counting are small enough to be counted 
exactly by a sketch, then you don't need a sketch.  Just use a hash table. 
 * If you are really interested in adding the count of "one" for items like 
nulls and empties, then I suggest to detect these values before submitting to 
the sketch and substitute a proxy value of your choice that the sketch can 
hash.   
 * Proxy values can only be chosen in an application context. It would be 
inappropriate for the sketches to choose some proxy value as it might collide 
with an actual value in another application context.

 

I will leave this issue open for one more day then close it.

 

 

 

> HLL doesn't take empty strings as distinct values
> -------------------------------------------------
>
>                 Key: DATASKETCHES-8
>                 URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
>             Project: Apache Datasketches
>          Issue Type: Bug
>            Reporter: Adam Tamas
>            Priority: Major
>
> Using ds_hll Hive is not counting empty strings as distinct values for string 
> and varchar columns.
> Example:
> With a t table with the following (string, char(1), varchar(1)) values:
> {code:java}
> +------+------+------+
> | t.s  | t.c  | t.v  |
> +------+------+------+
> |      |      |      |
> | a    | a    | a    |
> |      |      |      |
> | a    | a    | a    |
> | s    | s    | s    |
> | d    | d    | d    |
> +------+------+------+
> {code}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
> ds_hll_estimate(ds_hll_sketch(v)) from t;
> {code:java}
> +--------------------+--------------------+--------------------+
> |        _c0         |        _c1         |        _c2         |
> +--------------------+--------------------+--------------------+
> | 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
> +--------------------+--------------------+--------------------+
> {code}
> Could be a problem here: 
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> Char is working because it is filled with spaces up to the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org
For additional commands, e-mail: dev-h...@datasketches.apache.org

Reply via email to