alamb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3392268043
> Yep could be that! I was thinking maybe the last row group would be beneficial because (assuming the data is basically Parquet data) This would work well if the data isn't sorted before writing (so the footer is a reasonably proxy for a random sample). If you sort the data beforehand the last row group probably isn't a good random sample > Also sadly our Parquet reader cannot be pointed at a byte range of a file (I think that'd be easy to fix in a PR) With the metadata you can always figure out the ranges of each column chunk However, I don't think you can just get the last 10% of the rows in the last row group, because data is stored column by column, so the data for the last 10% of the rows are going to be spread across multiple distinct ranges (for each column) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
