tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708382916
> arrow-rs at least exposed the selectivity of filters after each file is read It is possible to provide an implementation of ArrowPredicate that tracks this. IIRC there even is a test in the parquet crate that does just this. > Adjusting between files should be pretty simple. This sounds like a nice idea, and I agree should be a relatively straightforward lift. > I don’t think we have enough information to do anything useful from statistics I'm no expert here, but this seems off to me. Parquet provides metadata about sort orders, min/max values, null counts, etc... Unfortunately distinct_count is rarely populated, but I think you should be able to do something by checking to see if a column spilled its dictionary - if not the number of values in the dictionary page will tell you the distinct value count. You may be able to use the column index for this, not sure. But to say there is nothing, seems overly pessimistic... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
