adriangb opened a new issue, #19098:
URL: https://github.com/apache/datafusion/issues/19098
### Is your feature request related to a problem or challenge?
In #19094 we are going to fix incorrect `total_byte_size` calculations for
`Statistics` by making them `Inexact` / `Absent` when we can't actually
calculate the size of the data. While this is more correct, it would be nice if
we could calculate scan sizes, etc. in more scenarios. In particular, we cannot
calculate the scan sizes of variable length columns (e.g. `Utf8`) from just the
type and number of rows.
To address this I propose we add `ColumnStatistics { scan_byte_size:
Precision<usize>, ... }` which can be populated by the file format e.g. because
we know that the in-memory Arrow size is the same as the Parquet uncompressed
size of the Parquet column for `Utf8View`. I don't know in how many cases we'll
be able to derive this information without reading the data but I think in some
cases we should be able to.
Then once we have this we can track the total scan size through projections,
limits, etc.
### Describe the solution you'd like
_No response_
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]