Supporting Datasketches HllSketch via Spark Functions

Ryan Berti Wed, 19 Apr 2023 10:33:39 -0700

Hello!

In early January, I sent an e-mail to this dev list proposing the addition
of new functionality to the approx_count_distinct implementation. After
discussions with @Daniel Tenedorio <daniel.tenedo...@databricks.com> and
team, we decided to shift focus to integrating Apache Datasketches with
Spark rather than extending the approx_count_distinct implementation. See
the e-mail titled 'Implementation for approx_count_distinct_sketch and
associated functions' and the discussion in this PR
<https://github.com/RyanBerti/spark/pull/1> for more context.


I've completed the proposed integration and would like to open up the new PR
<https://github.com/apache/spark/pull/40615> for wider review. The PR
provides 4 new functions:

Aggregate functions:

   - hll_sketch_agg(IntegerType|LongType|StringType|BinaryType) ->
   BinaryType
   - hll_union_agg(BinaryType) -> BinaryType

Non-aggregate functions:

   - hll_union(BinaryType, BinaryType) -> BinaryType
   - hll_sketch_estimate(BinaryType) -> LongType

The latest set of tests failed due to some connectivity(?) issues - is
there an easy way to re-drive tests without pushing a new commit?

Thanks!

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

Supporting Datasketches HllSketch via Spark Functions

Reply via email to