Re: File search progress: database review and question on triggers

zimoun Sat, 10 Oct 2020 09:04:35 -0700

Hi,

On Mon, 05 Oct 2020 at 20:53, Pierre Neidhardt <m...@ambrevar.xyz> wrote:


> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.

Maybe I am out-of-scope, but re-reading *all* the discussion about
“fileserch”, is it possible to really do better than “locate”? As
Ricardo mentioned.

--8<---------------cut here---------------start------------->8---
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real    0m19.903s
user    0m1.549s
sys     0m4.500s

du -sh /gnu/store /tmp/store.db
30G     /gnu/store
56M     /tmp/store.db

guix gc -F XXG
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real    0m10.105s
user    0m0.865s
sys     0m2.020s

du -sh /gnu/store /tmp/store.db
28G     /gnu/store
52M     /tmp/store.db
--8<---------------cut here---------------end--------------->8---

And then “locate” support regexp and regex and it is fast enough.

--8<---------------cut here---------------start------------->8---
echo 3 > /proc/sys/vm/drop_caches
time locate -d /tmp/store.db --regex "emacs-ma[a-z0-9\.\-]+\/[^.]+.el$" | tail 
-n5
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-transient.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-utils.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-wip.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-worktree.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit.el

real    0m3.601s
user    0m3.528s
sys     0m0.061s
--8<---------------cut here---------------end--------------->8---

The only point is that regexp is always cumbersome for me.  Well: «Some
people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems.» :-) [1]

[1] https://en.wikiquote.org/wiki/Jamie_Zawinski


> - Include synopsis and descriptions.  Maybe we should include all fields
>   that are searched by `guix search`.  This incurs a cost on the
>   database size but it would fix the `guix search` speed issue.  Size
>   increases by some 10 MiB.

>From my point of view, yes.  Somehow “filesearch” is a subpart of
“search”.  So it should be the machinery.


> I say we go with SQLite full-text search for now with all package
> details.  Switching to without full-text search is just a matter of a
> minor adjustment, which we can decide later when merging the final
> patch.  Same if we decide not to include the description, synopsis, etc.

[...]

> - Populate the database on demand, either after a `guix build` or from a
>   `guix filesearch...`.  This is important so that `guix filesearch`
>   works on packages built locally.  If `guix build`, I need help to know
>   where to plug it in.

[...]

> - Sync the databases from the substitute server to the client when
>   running `guix filesearch`.  For this I suggest we send the compressed
>   database corresponding to a guix generation over the network (around
>   10 MiB).  Not sure sending just the delta is worth it.

>From my point of view, how to transfer the database from substitutes to
users and how to locally update (custom channels or custom load path) are
not easy.  Maybe the core issues.


For example, I just did “guix pull” and “–list-generation” says from
f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::

   39.9 MB will be download

more the tiny bits before “Computing Guix derivation”.  Say 50MB max.

Well, the “locate” database for my “/gnu/store” (~30GB) is already to
~50MB, and ~20MB when compressed with gzip.  And Pierre said:

      The database will all package descriptions and synopsis is 46 MiB
      and compresses down to 11 MiB in zstd.

which is better but still something.  Well, it is not affordable to
fetch the database with “guix pull”, IMHO.


Therefore, the database would be fetched at the first “guix search”
(assuming point above).  But now, how “search” could know what is custom
build and what is not?  Somehow, “search” should scan all the store to
be able to update the database.

And what happens each time I am doing a custom build then “filesearch”.
The database should be updated, right?  Well, it seems almost unusable.

The model “updatedb/locate” seems better.  The user updates “manually”
if required and then location is fast.

Most of the cases are searching files in packages that are not my custom
packages.  IMHO.


To me, each time I am using “filesearch”:

 - first time: fetch the database corresponding the Guix commit and then
 update it with my local store
 - otherwise: use this database
 - optionally update the database if the user wants to include new
 custom items.


We could imagine a hook or option to “guix pull” specifying to also
fetch the database and update it at pull time instead of “search” time.
Personally, I prefer longer “guix pull” because it is already a bit long
and then fast “search” than half/half (not so long pull and longer
search).


WDYT?

> - Find a way to garbage-collect the database(s).  My intuition is that
>   we should have 1 database per Guix checkout and when we `guix gc` a
>   Guix checkout we collect the corresponding database.

Well, the exact same strategy as
~/.config/guix/current/lib/guix/package.cache can be used.


BTW, thanks Pierre for improving the Guix discoverability. :-)

Cheers,
simon

Re: File search progress: database review and question on triggers

Reply via email to