Re: [I] Improve performance of `dropDuplicates` [datafusion-comet]

via GitHub Tue, 08 Apr 2025 15:09:27 -0700


andygrove commented on issue #1275:
URL: 
https://github.com/apache/datafusion-comet/issues/1275#issuecomment-2787767219


   The performance issue seems specific to the `first_value` aggregate 
function. The `dropDuplicates` call is equivalent to:
   
   ```sql
   select ss_item_sk, ss_quantity, first_value(ss_sold_date_sk), 
first_value(ss_sold_time_sk), first_value(ss_customer_sk), 
first_value(ss_cdemo_sk), first_value(ss_hdemo_sk), first_value(ss_addr_sk), 
first_value(ss_store_sk), first_value(ss_promo_sk), 
first_value(ss_ticket_number), first_value(ss_wholesale_cost), 
first_value(ss_list_price), first_value(ss_sales_price), 
first_value(ss_ext_discount_amt), first_value(ss_ext_sales_price), 
first_value(ss_ext_wholesale_cost), first_value(ss_ext_list_price), 
first_value(ss_ext_tax), first_value(ss_coupon_amt), first_value(ss_net_paid), 
first_value(ss_net_paid_inc_tax), first_value(ss_net_profit) from store_sales 
group by ss_item_sk, ss_quantity
   ```
   
   Using `min` instead of `first_value` is much faster.
   
   Here is how I tested:
   
   ```scala
   def bench(aggr: String) {
     val df = spark.read.parquet("/mnt/bigdata/tpcds/sf100/store_sales.parquet")
     val cols = df.schema.fieldNames.filterNot(name => Seq("ss_item_sk", 
"ss_quantity").contains(name))
     df.repartition($"ss_item_sk").createOrReplaceTempView("store_sales")
     val sql = s"select ss_item_sk, ss_quantity, ${cols.map(name => 
s"$aggr($name)").mkString(", ")} from store_sales group by ss_item_sk, 
ss_quantity"
     spark.sql(sql).write.parquet("output2.parquet")
   }
   
   scala> spark.time(bench("first_value"))
   Time taken: 320671 ms 
   
   scala> spark.time(bench("min"))
   Time taken: 64462 ms
   ```
   
   I don't understand why there would be such a significant performance 
difference between `first_value` and `min`. I will try to reproduce this in 
DataFusion next.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve performance of `dropDuplicates` [datafusion-comet]

Reply via email to