[frameworks-baloo] [Bug 402154] Baloo reindexes everything after every reboot

Kai Krakow Sun, 02 May 2021 04:23:05 -0700

https://bugs.kde.org/show_bug.cgi?id=402154


--- Comment #22 from Kai Krakow <k...@kaishome.de> ---
(In reply to tagwerk19 from comment #20)
> (In reply to Kai Krakow from comment #18)
> > ... I suggest to
> > read that entirely to understand the problem ...
> I've done my best :-) Thank you for the info!
> 
> In:
> 
>     https://bugs.kde.org/show_bug.cgi?id=404057#c35
> 
> You have the the idea of an "Index per Filesystem" but then the idea seems

I didn't... I explained why that would not work.

> to have been put to the side. You mention "storage path" as a problem? Would
> the way "local wastebaskets" are managed on mounted filesystems be a model?
> They have to deal with the same issues as you've listed.

The problem is that you would have do deal with proper synchronization when
multiple databases are used. That is not just "find a writeable storage
location and register this location somewhere". Also, you would need to have
all these different DBs opened at the same time, and LMDB is a memory mapped
database with random access patterns. So you'd multiply the memory pressure
with each location, and that will dominate the filesystem cache.

>     https://phabricator.kde.org/T9805 

This mentions "store an identifier per tracked device, e.g the filesystem UUID"
which is probably my idea. Instead of using dev_id directly, the database
should have a lookup table where filesystem UUIDs are stored as a simple list.
The index of this list can be used as the new dev_id for the other tables.

> Has a mention of "... inside encrypted containers", see this also in Bug
> 390830.

Encrypted containers should never be indexed in a global database as that would
leak information from the encrypted container. The easiest solution would be to
just not index encrypted containers unless the database itself is stored in an
encrypted container - but that's also just an bandaid. Maybe encrypted
containers should not be stored at all. Putting LMDB on an encrypted containers
may have very bad side-effects on the performance side.

> As background thoughts...
> 
>     Things like "Tags:" folders in Dolphin and incremental searches
>     when you type into Krunner depend on baloosearch being lightning fast.

Having multiple databases per filesystem can only make this slower by
definition because you'd need to query multiple databases. From my personal
experience with fulltext search engines (ElasticSearch) I can only tell you
that querying indexes and recombining results properly is a huge pita, and it's
going to slow things way down. So the multiple database idea is probably a dead
end.

>     It would be a shame to lose the ability to search for phrases as in
>         baloosearch Hello_Penguin
>     as opposed to
>         baloosearch "Hello Penguin"
> 
>     I'm guessing BTRFS usage is going to grow.

The point is: Neither Linux nor POSIX state anywhere that a dev_id from stat()
is unique across reboots or remounts. This is even less true for inode numbers
with some remote filesystems or non-inode filesystems (where inode numbers are
virtual and may be allocated from some runtime state). Those are not stable
ids. At least for native Linux-filesystems we can expect inode numbers to be
stable as those are stored inside the FS itself (the dev_id isn't but UUID is).

On a side-note: In this context it would make sense to provide baloo as a
system-wide storage and query service shared by multiple users, with an indexer
running per user (to index encrypted containers). It's the only way to support
these ideas:

- safe access to encrypted containers
- the database can be isolated from being readable by users (prevents
  information leakage)
- solves the problem of multiple users indexing the same data multiple times
- has capabilities to properly read UUIDs from filesystems/subvolumes (some
  FS only allow this for root)
- can guard/filter which results are returned to users (by respecting FS ACLs
  and permission bits)
- shared index location (e.g. /usr/share/docs) would be indexed just once

On the contra side:

- needs some sort of synchronization between multiple indexers (should work
  around race conditions that multiple indexers do not read and index the same
  files twice), could be solved by running the indexer within the system-wide
  service, too, but access to encrypted containers needs to be evaluated

-- 
You are receiving this mail because:
You are watching all bug changes.

[frameworks-baloo] [Bug 402154] Baloo reindexes everything after every reboot

Reply via email to