GitHub user adriangb closed a discussion: Preserving row group information during reads
Hi folks, I'm essentially trying to implement an external bloom filter for columns with randomly distributed values (where min/max stats don't help) and where reading bloom filter information from all files may be too expensive (100s of high latency object store requests), hence why I want to store this information externally. The roadblock I'm running into is getting access to values _per row group_ like bloom filters do. Given that I have a DataFrame that I'm going to write out as I see it my options are: - Write it out then using the metadata read back one row group at a time so I can get the values per row group. This I should be able to do as is but is gross and won't perform well even if I have the data in memory. - Make changes to the writer APIs to allow hooking into the right spot where bloom filters are updated but instead call a callback, accumulate the values, etc. This would be the most efficient but requires API designs and changes to DataFusion. - Add a feature to inject the row group a row came from when reading a file (still need to read it back, requires changes to DataFusion, but at least would be simple on my end) Any suggestions on APIs or hook points I may be overlooking? Thanks! GitHub link: https://github.com/apache/datafusion/discussions/12498 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
