andygrove commented on issue #1275: URL: https://github.com/apache/datafusion-comet/issues/1275#issuecomment-2787767219
The performance issue seems specific to the `first_value` aggregate function. The `dropDuplicates` call is equivalent to: ```sql select ss_item_sk, ss_quantity, first_value(ss_sold_date_sk), first_value(ss_sold_time_sk), first_value(ss_customer_sk), first_value(ss_cdemo_sk), first_value(ss_hdemo_sk), first_value(ss_addr_sk), first_value(ss_store_sk), first_value(ss_promo_sk), first_value(ss_ticket_number), first_value(ss_wholesale_cost), first_value(ss_list_price), first_value(ss_sales_price), first_value(ss_ext_discount_amt), first_value(ss_ext_sales_price), first_value(ss_ext_wholesale_cost), first_value(ss_ext_list_price), first_value(ss_ext_tax), first_value(ss_coupon_amt), first_value(ss_net_paid), first_value(ss_net_paid_inc_tax), first_value(ss_net_profit) from store_sales group by ss_item_sk, ss_quantity ``` Using `min` instead of `first_value` is much faster. Here is how I tested: ```scala def bench(aggr: String) { val df = spark.read.parquet("/mnt/bigdata/tpcds/sf100/store_sales.parquet") val cols = df.schema.fieldNames.filterNot(name => Seq("ss_item_sk", "ss_quantity").contains(name)) df.repartition($"ss_item_sk").createOrReplaceTempView("store_sales") val sql = s"select ss_item_sk, ss_quantity, ${cols.map(name => s"$aggr($name)").mkString(", ")} from store_sales group by ss_item_sk, ss_quantity" spark.sql(sql).write.parquet("output2.parquet") } scala> spark.time(bench("first_value")) Time taken: 320671 ms scala> spark.time(bench("min")) Time taken: 64462 ms ``` I don't understand why there would be such a significant performance difference between `first_value` and `min`. I will try to reproduce this in DataFusion next. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org