[ 
https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Tamas updated DATASKETCHES-8:
----------------------------------
    Description: 
Using ds_hll Hive is not counting empty strings as distinct values for string 
and varchar columns.

Example:
With a t table with the following (string, char(1), varchar(1)) values:

{code:java}
+------+------+------+
| t.s  | t.c  | t.v  |
+------+------+------+
|      |      |      |
| a    | a    | a    |
|      |      |      |
| a    | a    | a    |
| s    | s    | s    |
| d    | d    | d    |
+------+------+------+
{code}


select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from t;


{code:java}
+--------------------+--------------------+--------------------+
|        _c0         |        _c1         |        _c2         |
+--------------------+--------------------+--------------------+
| 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
+--------------------+--------------------+--------------------+
{code}

Could be a problem here: 
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
but for char it is working fine.

  was:
Using ds_hll Hive is not counting empty strings as distinct values for string 
and varchar columns.

Example:
With a t table with the following (string, char(1), varchar(1)) values:

{code:java}
+------+------+------+
| t.s  | t.c  | t.v  |
+------+------+------+
|      |      |      |
| a    | a    | a    |
|      |      |      |
| a    | a    | a    |
| s    | s    | s    |
| d    | d    | d    |
+------+------+------+
{code}


select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from t;


{code:java}
+--------------------+--------------------+--------------------+
|        _c0         |        _c1         |        _c2         |
+--------------------+--------------------+--------------------+
| 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
+--------------------+--------------------+--------------------+
{code}



> HLL doesn't take empty strings as distinct values
> -------------------------------------------------
>
>                 Key: DATASKETCHES-8
>                 URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
>             Project: Apache Datasketches
>          Issue Type: Bug
>            Reporter: Adam Tamas
>            Priority: Major
>
> Using ds_hll Hive is not counting empty strings as distinct values for string 
> and varchar columns.
> Example:
> With a t table with the following (string, char(1), varchar(1)) values:
> {code:java}
> +------+------+------+
> | t.s  | t.c  | t.v  |
> +------+------+------+
> |      |      |      |
> | a    | a    | a    |
> |      |      |      |
> | a    | a    | a    |
> | s    | s    | s    |
> | d    | d    | d    |
> +------+------+------+
> {code}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
> ds_hll_estimate(ds_hll_sketch(v)) from t;
> {code:java}
> +--------------------+--------------------+--------------------+
> |        _c0         |        _c1         |        _c2         |
> +--------------------+--------------------+--------------------+
> | 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
> +--------------------+--------------------+--------------------+
> {code}
> Could be a problem here: 
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> but for char it is working fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org
For additional commands, e-mail: dev-h...@datasketches.apache.org

Reply via email to