On 2013-03-26 01:20, Nick Black wrote: > Stefan Fritsch left as an exercise for the reader: >> Well, the contents files are much larger than the package files and >> are usually used less frequently. So some users may prefer to download >> the contents files only when necessary. Apart from that, I don't see > > then can't they just leave apt-file uninstalled? especially as installing it > would perform the initial apt-get update? > >> any problem. But that's not my decision anymore :-) > > yeah i'm not wedded to any particular solution, but this one seems right to > me. if it's something that's been thraded out at length, though, no need to > entertain my suggestions. >
Sounds like we should have some config variable that people can set (or clear) to disable Contents fetching via apt-get update. Assuming the APT side of this would support such a use-case, I think we can have it. But to be honest, I would really like to remove the "apt-file update" if I can get away with it. It always seemed like something APT ought to do... though I suppose if I end up delegating the entire thing to apt-get update it will not really have any maintenance overhead. >> - Significant speedup could be attained by recompressing the local >> file with lzop instead of gzip. You write "processing time is roughly > I tried a bit with lzop and indeed it seems to half my runtimes with search and show (at least the non-regex variant). Though it comes at lower compression rates, which is not a problem atm but might be when multi-arch support is "added" (also see my comment about redundancy below). > Absolutely. If we can make local changes, there's all kinds of things we can > do. I left this entire class of optimizations aside. > > For that matter, if we stripped the free-form header section, that would, > perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an > implementation which conformed precisely to current disk layouts and such, > since I want to deploy this in SprezzOS independently of Debian's decision. > There are also things we could do at update time: * pre-appending / to all paths as people expect that there is a leading slash. To this end, apt-file is currently trying to rewrite people's search pattern to match reality but I hope we could eventually avoid that (because it does not work in all cases etc.). * remove redundancy between Contents-* files. Between unstable and testing (or i386 and amd64) there is a huge overlap in files. That would likely allow us to scale better as the number of architectures and distributions enabled increase. (related bugs include #658794, #578727 and #632254) * make optimized caches for certain use-cases like "list/show". Maybe even "match pattern X against programs in default PATH". The second item probably require merging the Contents files, which we probably need to do in a very efficient manner. I believe the files are pre-sorted, so we could abuse this to do the "merge" part of mergesort without having the whole ordeal loaded in memory (which is sadly quickly measured in GB). >> - Try benchmarks with a single core, too. It's nice if you can use >> more cores but you don't want to have too much regression on single >> core systems. > > Yep, I will send those around. I'm not doing anything stupid like launching > a fixed-sized pool; libblossom offers us per-CPU workers etc. > >> - apt-file regex search performance sucks because it doesn't use grep. >> Nowadays grep has -P, so grep could be used, too. Which regex type do >> you use? > Also possibly because Perl (Python, Java etc.) uses an expensive regular expression implementation[1]. > Hold on for a bit of theory: > > - I'm matching multiple patterns using an "Advanced Aho-Corasick > automaton". The set of all AACAs is clearly a subset of all DFA (discrete > finite automatons). > I think you mean s/discrete/deterministic/ as NFAs (which can be used to match any regular language as well) is a "Non-deterministic finite automaton" > - The set of languages recognized by DFAs is equivalent to the set of > languages recognized by regular languages. > > - This, any regular operation can be encoded into a DFA, though possibly at > a high cost in states. See Sipser or Hopcroft+Ullman. > > - Thus, we just encode the regular operations as alternates in our AACA. > Since we already match the AACA in time independent of the number of > patterns, adding these alternate patterns costs us no time in the main, > but only in the preprocessing. > > I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil > for one pattern, or AAC for multiple patterns. > >> [...] > Hack on! > > --nick > > True, but the perl "regular repression" is in fact more powerful than a NFA. Admittedly I believe the only real feature that exceeds NFAs is the "backref"s, which are thankfully not used that often. I have no concerns about compiling the "perl regex" case into a DFA/NFA were possible, but we have to either handle the backref case or explicitly document that backrefs are not supported. I am planning on doing an apt-file 3 release post Wheezy where I permit backwards incompatible changes (e.g. exit code and making -F default with show/list), so either way we choose to do it will be fine. ~Niels [1] http://swtch.com/~rsc/regexp/regexp1.html Article is from 2007, so things could have changed. Though it is my understanding that they haven't. NB: The first two graphs do not have same unit on the Y-axis (i.e. seconds vs. micro seconds). -- To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/515153fc.8090...@thykier.net