[I] Track individual column sizes in `Statistics` [datafusion]

via GitHub Thu, 04 Dec 2025 16:13:48 -0800


adriangb opened a new issue, #19098:
URL: https://github.com/apache/datafusion/issues/19098


   ### Is your feature request related to a problem or challenge?
   
   In #19094 we are going to fix incorrect `total_byte_size` calculations for 
`Statistics` by making them `Inexact` / `Absent` when we can't actually 
calculate the size of the data. While this is more correct, it would be nice if 
we could calculate scan sizes, etc. in more scenarios. In particular, we cannot 
calculate the scan sizes of variable length columns (e.g. `Utf8`) from just the 
type and number of rows.
   
   To address this I propose we add `ColumnStatistics { scan_byte_size: 
Precision<usize>, ... }` which can be populated by the file format e.g. because 
we know that the in-memory Arrow size is the same as the Parquet uncompressed 
size of the Parquet column for `Utf8View`. I don't know in how many cases we'll 
be able to derive this information without reading the data but I think in some 
cases we should be able to.
   
   Then once we have this we can track the total scan size through projections, 
limits, etc.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Track individual column sizes in `Statistics` [datafusion]

Reply via email to