Hello! In early January, I sent an e-mail to this dev list proposing the addition of new functionality to the approx_count_distinct implementation. After discussions with @Daniel Tenedorio <daniel.tenedo...@databricks.com> and team, we decided to shift focus to integrating Apache Datasketches with Spark rather than extending the approx_count_distinct implementation. See the e-mail titled 'Implementation for approx_count_distinct_sketch and associated functions' and the discussion in this PR <https://github.com/RyanBerti/spark/pull/1> for more context.
I've completed the proposed integration and would like to open up the new PR <https://github.com/apache/spark/pull/40615> for wider review. The PR provides 4 new functions: Aggregate functions: - hll_sketch_agg(IntegerType|LongType|StringType|BinaryType) -> BinaryType - hll_union_agg(BinaryType) -> BinaryType Non-aggregate functions: - hll_union(BinaryType, BinaryType) -> BinaryType - hll_sketch_estimate(BinaryType) -> LongType The latest set of tests failed due to some connectivity(?) issues - is there an easy way to re-drive tests without pushing a new commit? Thanks! Ryan Berti Senior Data Engineer | Ads DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028