tustvold commented on issue #3463:
URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708382916

   > arrow-rs at least exposed the selectivity of filters after each file is 
read
   
   It is possible to provide an implementation of ArrowPredicate that tracks 
this. IIRC there even is a test in the parquet crate that does just this.
   
   > Adjusting between files should be pretty simple.
   
   This sounds like a nice idea, and I agree should be a relatively 
straightforward lift.
   
   > I don’t think we have enough information to do anything useful from 
statistics
   
   I'm no expert here, but this seems off to me. Parquet provides metadata 
about sort orders, min/max values, null counts, etc... Unfortunately 
distinct_count is rarely populated, but I think you should be able to do 
something by checking to see if a column spilled its dictionary - if not the 
number of values in the dictionary page will tell you the distinct value count. 
You may be able to use the column index for this, not sure. But to say there is 
nothing, seems overly pessimistic...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to