Stefan Fritsch left as an exercise for the reader: > Well, the contents files are much larger than the package files and > are usually used less frequently. So some users may prefer to download > the contents files only when necessary. Apart from that, I don't see
then can't they just leave apt-file uninstalled? especially as installing it would perform the initial apt-get update? > any problem. But that's not my decision anymore :-) yeah i'm not wedded to any particular solution, but this one seems right to me. if it's something that's been thraded out at length, though, no need to entertain my suggestions. > - Significant speedup could be attained by recompressing the local > file with lzop instead of gzip. You write "processing time is roughly Absolutely. If we can make local changes, there's all kinds of things we can do. I left this entire class of optimizations aside. For that matter, if we stripped the free-form header section, that would, perhaps surprisingly, *vastly* simplify my code. Again, I wanted to do an implementation which conformed precisely to current disk layouts and such, since I want to deploy this in SprezzOS independently of Debian's decision. > - Try benchmarks with a single core, too. It's nice if you can use > more cores but you don't want to have too much regression on single > core systems. Yep, I will send those around. I'm not doing anything stupid like launching a fixed-sized pool; libblossom offers us per-CPU workers etc. > - apt-file regex search performance sucks because it doesn't use grep. > Nowadays grep has -P, so grep could be used, too. Which regex type do > you use? Hold on for a bit of theory: - I'm matching multiple patterns using an "Advanced Aho-Corasick automaton". The set of all AACAs is clearly a subset of all DFA (discrete finite automatons). - The set of languages recognized by DFAs is equivalent to the set of languages recognized by regular languages. - This, any regular operation can be encoded into a DFA, though possibly at a high cost in states. See Sipser or Hopcroft+Ullman. - Thus, we just encode the regular operations as alternates in our AACA. Since we already match the AACA in time independent of the number of patterns, adding these alternate patterns costs us no time in the main, but only in the preprocessing. I'm doing basically the exact same thing grep/egrep does: Boyer-Moore-Galil for one pattern, or AAC for multiple patterns. > - Are you limiting the used memory? Remember there may still be VMs > with 256MB RAM and you shouldn't cause swapping on such systems. Even if they only have 256MB of physical RAM, they still have however large a virtual address space. mmap() is not going to map in all the requested pages; it's just associating a VMA in our process with them. They'll be faulted in as they're referenced. Thus it's not how much we have mmap()d (which can be large, equivalent to the sum of the compressed files plus dynamic state). Our dynamic state per thread is limited, however, so we'll never be say trying to uncompress 256MB of text (this would be bad for parallelism anyway). Hope that answers these questions. Feel free to hit me with more. Hack on! --nick -- nick black http://www.sprezzatech.com -- unix and hpc consulting to make an apple pie from scratch, you need first invent a universe. -- To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130326002009.ga12...@qemfd.net