valkum opened a new issue, #17445: URL: https://github.com/apache/datafusion/issues/17445
### Describe the bug I tried to come up with a repro case for a slow query (compared to polars) and encountered this bug. As the error is originating in https://github.com/apache/arrow-rs, I assume this is more of a tracking issue, but it might also be caused by how datafusion calls parquet. I haven't looked for the root cause yet. With a sample dataset of 10m rows, I run into the following issue: `pyo3_runtime.PanicException: MutableArrayData::new is infallible: DictionaryKeyOverflowError` What this does is it takes an unnested schema, nests it and aggregates the newly nested struct into a list. This does not happen for 1m rows. The input parquet file only contains a Dict with 10 keys. I am not sure why this blows up. When nesting the dict key, the same dict can be reused. ### To Reproduce For a repro case see here https://github.com/valkum/polars-datafusion-comparison/tree/datafusion_bug ### Expected behavior DataFrame.write_parquet should succeed. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org