valkum opened a new issue, #17445:
URL: https://github.com/apache/datafusion/issues/17445

   ### Describe the bug
   
   I tried to come up with a repro case for a slow query (compared to polars) 
and encountered this bug. As the error is originating in 
https://github.com/apache/arrow-rs, I assume this is more of a tracking issue, 
but it might also be caused by how datafusion calls parquet. I haven't looked 
for the root cause yet.
   
   With a sample dataset of 10m rows, I run into the following issue:
   `pyo3_runtime.PanicException: MutableArrayData::new is infallible: 
DictionaryKeyOverflowError`
   
   What this does is it takes an unnested schema, nests it and aggregates the 
newly nested struct into a list.
   
   This does not happen for 1m rows. The input parquet file only contains a 
Dict with 10 keys. I am not sure why this blows up. When nesting the dict key, 
the same dict can be reused.
   
   ### To Reproduce
   
   For a repro case see here 
https://github.com/valkum/polars-datafusion-comparison/tree/datafusion_bug
   
   ### Expected behavior
   
   DataFrame.write_parquet should succeed.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to