Re: [I] Slow aggregrate query, Polars is 4 times faster for equal query [datafusion]

via GitHub Fri, 05 Sep 2025 21:00:49 -0700


alamb commented on issue #17446:
URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3259715207


   Thank you @valkum 
   
   I tried the reproducer locally and I do also see the difference reported. I 
rewrote the query into the equivalent SQL as that was easier to profile for me:
   
   Here is the input: 
   
   [repo.zip](https://github.com/user-attachments/files/22181029/repo.zip)
   
   ```sql
   COPY (
     WITH df as (
       SELECT
         name, group,
         CASE WHEN market IS NOT NULL AND price IS NOT NULL THEN 
named_struct('market', market, 'price', price) ELSE NULL END as market
       FROM 'sample-1m.parquet'
     ),
     df2 as (
       SELECT
         name,
         group,
         array_agg(market) as markets
       FROM df
       GROUP BY
         name,
         group
     )
     SELECT
       name,
       group,
       CASE WHEN markets[0] IS NOT NULL THEN markets ELSE NULL END
     FROM df2
   ) TO 'output-datafusion.parquet';
   ```
   
   When I did some profiling with samply
   
   ```shell
   samply record -- ~/Software/datafusion-cli/datafusion-cli-49.0.0 -f 
report.sql
   ```
   
   Half the time is spent in the array_agg implementation
   
   <img width="1725" height="441" alt="Image" 
src="https://github.com/user-attachments/assets/f1fd769b-816b-4c93-a1aa-adcecfd3528a";
 />
   
   
   So I think a lot of the difference would be fixed with a better array_agg 
implementation
   
   here is one idea of how to do it:
   - https://github.com/apache/datafusion/issues/10145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Slow aggregrate query, Polars is 4 times faster for equal query [datafusion]

Reply via email to