alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048541094
I tried changing the code to use ChunkedArray rather than a single array ```diff -names_array = pa.concat_array([pa.array(names)] * batches) +names_array = pa.chunked_array([pa.array(names)] * batches) ``` So the table now looks like ```python names_array = pa.chunked_array([pa.array(names)] * batches) values_array = pa.chunked_array([pa.array(np.random.randint(1, 100, len(names))) for _ in range(batches)]) pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", "value"]) ``` And then actually I see the revert performance (duckdb is slower 🤯 ): ```shell (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ python repro.py 1.3.2 47.0.0 duckdb : 981.68ms datafusion : 292.68ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org