Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Sat, 18 Oct 2025 13:54:25 -0700


alamb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3392268043


   > Yep could be that! I was thinking maybe the last row group would be 
beneficial because (assuming the data is basically Parquet data) 
   
   This would work well if the data isn't sorted before writing (so the footer 
is a reasonably proxy for a random sample). If you sort the data beforehand the 
last row group probably isn't a good random sample
   
   > Also sadly our Parquet reader cannot be pointed at a byte range of a file 
(I think that'd be easy to fix in a PR) 
   
   With the metadata you can always figure out the ranges of each column chunk
   
   However, I don't think you can just get the last 10% of the rows in the last 
row group, because data is stored column by column, so the data for the last 
10% of the rows are going to be spread across multiple distinct ranges (for 
each column)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to