GitHub user adriangb closed a discussion: Preserving row group information 
during reads

Hi folks,

I'm essentially trying to implement an external bloom filter for columns with 
randomly distributed values (where min/max stats don't help) and where reading 
bloom filter information from all files may be too expensive (100s of high 
latency object store requests), hence why I want to store this information 
externally.

The roadblock I'm running into is getting access to values _per row group_ like 
bloom filters do.

Given that I have a DataFrame that I'm going to write out as I see it my 
options are:
- Write it out then using the metadata read back one row group at a time so I 
can get the values per row group. This I should be able to do as is but is 
gross and won't perform well even if I have the data in memory.
- Make changes to the writer APIs to allow hooking into the right spot where 
bloom filters are updated but instead call a callback, accumulate the values, 
etc. This would be the most efficient but requires API designs and changes to 
DataFusion.
- Add a feature to inject the row group a row came from when reading a file 
(still need to read it back, requires changes to DataFusion, but at least would 
be simple on my end)

Any suggestions on APIs or hook points I may be overlooking?

Thanks!

GitHub link: https://github.com/apache/datafusion/discussions/12498

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to