Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

via GitHub Tue, 08 Jul 2025 04:37:07 -0700


alamb commented on issue #16707:
URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048541094


   I tried changing the code to use ChunkedArray rather than a single array
   ```diff
   -names_array = pa.concat_array([pa.array(names)] * batches)
   +names_array = pa.chunked_array([pa.array(names)] * batches)
   ```
   
   So the table now looks like 
   
   
   ```python
   names_array = pa.chunked_array([pa.array(names)] * batches)
   values_array = pa.chunked_array([pa.array(np.random.randint(1, 100, 
len(names))) for _ in range(batches)])
   
   pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", 
"value"])
   ```
   
   And then actually I see the revert  performance (duckdb is slower 🤯 ):
   
   ```shell
   (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ python repro.py
   1.3.2
   47.0.0
   duckdb      : 981.68ms
   datafusion  : 292.68ms
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

Reply via email to