Hi, Ryan Prior <rpr...@protonmail.com> skribis:
> On Friday, January 21st, 2022 at 9:03 AM, Ludovic Courtès <l...@gnu.org> > wrote: > >> The database for 18K packages is quite big: >> >> --8<---------------cut here---------------start------------->8--- >> >> $ du -h /tmp/db* >> >> 389M /tmp/db >> >> 82M /tmp/db.gz >> >> 61M /tmp/db.zst >> >> --8<---------------cut here---------------end--------------->8--- >> [snip] >> In terms of privacy, I think it’s better if we can avoid making >> one request per file searched for. Off-line operation would be >> sweet, and it comes with responsiveness; fast off-line search is >> necessary for things like ‘command-not-found’ (where the shell >> tells you what package to install when a command is not found). > > Offline operation is crucial, and I don't think it's desirable to download > tens or hundreds of megabytes. What about creating & distributing a bloom > filter per package, with members being file names? This would allow us to > dramatically reduce the size of data we distribute, at the cost of not giving > 100% reliable answers. We've established, though, that some information is > better than none, and the uncertainty can be resolved by querying a web > service or building the package locally and searching its directory. My understanding is that Bloom filters are sets essentially, but here we need more than that: we need to map files to package names. Or am I misunderstanding what you have in mind? Thanks, Ludo’.