Hi, Datasketches has out-of-box HLL UDAF in hive, when I tried in spark, I got errors. Can someone explain why it is failing in spark?
spark-shell --jars datasketches-memory-1.2.0-incubating.jar,datasketches-hive-1.0.0-incubating.jar,datasketches-java-1.2.0-incubating.jar spark.sql("""create temporary function data2sketch as 'org.apache.datasketches.hive.hll.DataToSketchUDAF'""") spark.sql("""with v as (select 'a' x union select 'b') select data2sketch(x) from v""").show Caused by: java.lang.ClassCastException: org.apache.datasketches.hive.hll.SketchState cannot be cast to org.apache.datasketches.hive.hll.UnionState at org.apache.datasketches.hive.hll.SketchEvaluator.merge(SketchEvaluator.java:69) at org.apache.datasketches.hive.hll.DataToSketchUDAF$DataToSketchEvaluator.merge(DataToSketchUDAF.java:114) at org.apache.spark.sql.hive.HiveUDAFFunction.merge(hiveUDFs.scala:421) at org.apache.spark.sql.hive.HiveUDAFFunction.merge(hiveUDFs.scala:307) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.merge(interfaces.scala:541) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$2.apply(AggregationIterator.scala:174) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$2.apply(AggregationIterator.scala:174) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:188) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:182) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:152) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) ... Best