alamb commented on issue #17446: URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3259715207
Thank you @valkum I tried the reproducer locally and I do also see the difference reported. I rewrote the query into the equivalent SQL as that was easier to profile for me: Here is the input: [repo.zip](https://github.com/user-attachments/files/22181029/repo.zip) ```sql COPY ( WITH df as ( SELECT name, group, CASE WHEN market IS NOT NULL AND price IS NOT NULL THEN named_struct('market', market, 'price', price) ELSE NULL END as market FROM 'sample-1m.parquet' ), df2 as ( SELECT name, group, array_agg(market) as markets FROM df GROUP BY name, group ) SELECT name, group, CASE WHEN markets[0] IS NOT NULL THEN markets ELSE NULL END FROM df2 ) TO 'output-datafusion.parquet'; ``` When I did some profiling with samply ```shell samply record -- ~/Software/datafusion-cli/datafusion-cli-49.0.0 -f report.sql ``` Half the time is spent in the array_agg implementation <img width="1725" height="441" alt="Image" src="https://github.com/user-attachments/assets/f1fd769b-816b-4c93-a1aa-adcecfd3528a" /> So I think a lot of the difference would be fixed with a better array_agg implementation here is one idea of how to do it: - https://github.com/apache/datafusion/issues/10145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org