nick black left as an exercise for the reader: > I've written up a performance characterization of apt-file(1) as stands. The > preliminary version is available here: > > http://nick-black.com/raptorial-file
OK, finished this writeup. I've updated the file at the link above, and included it below. I'm going to go ahead and knock out bugs 705[0] and 698[1], which will together implement everything described herein. Hopefully that'll be done by tomorrow morningish, EDT. Hack on! --rigorously, nick [0] https://www.sprezzatech.com/bugs/show_bug.cgi?id=705 [1] https://www.sprezzatech.com/bugs/show_bug.cgi?id=698 Optimizing apt-file =================== All timings were taken on a quadcore Core i7 2600K overclocked to 5.0GHz, reading files from an Intel 320 SSD. All data sets fit within main memory. apt-file has a more complex performance space than did apt-show-versions. Understanding this space is critical to optimizing apt-file, and evaluating any optimizations made. apt-file doesn't perform string searches itself; it uses grep, zgrep and zfgrep. Furthermore, while apt-file is not multithreaded, it does run multiple processes (apt-file -> gzip | grep), and thus natively takes advantage of multiple cpus within a given Contents file. In this top(1) output, the last column is the CPU number: 8529 dank 7952 848 688 R 21.9 0:00.66 grep 3 8522 dank 66508 17m 3436 S 11.6 0:00.35 apt-file 0 8528 dank 8840 668 508 R 10.6 0:00.32 gzip 2 Of course, on a unicore machine, this use of three different processes will provide no computational benefit (it might provide I/O benefit). It is worth noting that the -x (regex) functionality of apt-file appears to be broken or at least incomplete. Neither of the following searches, for instance, yield any matches for "compiz" (or indeed all of the matches for "frog"): [skynet](0) $ apt-file search -x frog\|compiz frog: /usr/bin/frog wmfrog: /usr/bin/wmfrog [skynet](0) $ apt-file search -x \(frog\)\|\(compiz\) frog: /usr/bin/frog wmfrog: /usr/bin/wmfrog [skynet](0) $ Running "apt-file frog compiz" finds all of the frog matches, but none of the compiz matches. It appears that apt-file cannot use multiple patterns off the command line, but it doesn't bother to warn the user about that. We can create a pattern file, however, and that works: [skynet](0) $ echo -e "frog\ncompiz" > searches [skynet](0) $ apt-file -f search searches | wc -l 3874 [skynet](0) $ apt-file search compiz | wc -l 1112 [skynet](0) $ apt-file search frog | wc -l 2762 [skynet](0) $ echo $((3874 - 1112)) 2762 [skynet](0) $ Characterizing apt-file performance =================================== Background material: "Flexible Pattern Matching in Strings" by Navarro et al "Introduction to Algorithms" by Cormen et al String search algorithms tend to be optimized for one use or another. Searching within DNA, for instance, is a very different game than searching within UTF-32, which is very different from searching within UTF-8. We will consider only exact string matching, and not "fuzzy" string matching (i.e., matching within some string distance). Parameterizing the problem space under this condition yields: - alphabet size - needle length - haystack length - number of needles - needle heterogeneity - frequency of occurrence Note that in the presence of Kleene closure, there might be an infinite number of needles. In the case of an online algorithm, the haystack is of infinite size. All other parameters are finite. The alphabet size and haystack length are independent of apt-file(1). Let's explore performance when we vary the other parameters: Varying needle length ===================== It would be nice to have short needles of various length which held the frequency of occurrence constant, but that'll have to come another day. It'll be easy, as we'll see, to vary frequency without varying length. [skynet](0) $ for i in compizabd compizab compiza compiz compi comp com co c ; do time apt-file search $i | wc -l ; done 0 real 0m1.655s 0 real 0m1.673s 0 real 0m1.654s 1112 real 0m1.854s 17293 real 0m2.850s 86277 real 0m4.971s 225415 real 0m7.667s 998349 real 0m22.491s 4019431 real 1m30.639s [skynet](0) $ apt-file(1) clearly performs better for longer strings, although this could be due to their lesser frequency of occurrence. Times are approximately equal for the three longest strings, but they also have equal frequency (0 occurrences). This points to a backwards-skipping algorithm such as Boyer-Moore (with longer strings, we get longer skips, and the ratio of skipped text for different string lengths (N, N-1) converges towards 1 as N grows large (i.e., there's a much bigger difference between skipping 1 vs 2 than skipping 10 vs 11). This could all be due to I/O, though. To check, we comment out the print command in the innermost loop of print_winners(); I'll spare you the block of timings and just assert that the absolute timings fell, but not by much. It does indeed appear that a backwards-skipping algorithm is in use. Needle heterogeneity ==================== We'll hold length and frequency constant, and vary the homogeneity: [skynet](0) $ for i in abcd aaaa aaaa abcd ; do time apt-file search $i | wc -l ; done 113 real 0m1.739s user 0m2.047s sys 0m0.163s 122 real 0m1.827s user 0m2.227s sys 0m0.117s 122 real 0m1.813s user 0m2.157s sys 0m0.167s 113 real 0m1.715s user 0m2.003s sys 0m0.167s [skynet](0) $ The slightly longer times on "aaaa" vs "abcd", despite nearly equal frequency, suggest that a backwards-skipping rather than a forwards-skipping algorithm is in use, as a homogeneous needle will affect the former more (questionable!). Number of needles ================= Frequency of occurrence ======================= We'll hold length constant, and vary the number of matches: [skynet](0) $ for i in th gh es zz ; do time apt-file search $i | wc -l ; done 590993 real 0m14.883s user 0m20.290s sys 0m0.953s 186325 real 0m7.139s user 0m10.440s sys 0m0.453s 1626325 real 0m36.047s user 0m43.220s sys 0m2.497s 19455 real 0m2.698s user 0m3.927s sys 0m0.200s [skynet](0) $ All things being equal, we'd expect that more matches would be faster, as it would allow more text to be skipped (we can stop matching a given haystack as soon as we find a match). That it does not suggests that the postprocessing step of apt-file(1) (a perl "uniq sort") probably has crap running time. Once again, I tested this with no output, and got very similar numbers. Speeding it up ============== First, a constraint: we're going to use unmodified zlib to inflate the files, which precludes matching as we inflate (there is no way to insert such a function into zlib 1.2.7). There will thus necessarily be a strong dependency between chunk inflation and chunk processing. This has two effects, both negative: we can't process a chunk in parallel with that chunk's inflation, and we are likely to suffer both a write-miss (during deflation) and a read-miss (during processing) rather than merely a write-miss, or possibly no cache miss at all for non-matching text (if we inflated and processed non-matches wholly within registers). Let's first look at how to best parallelize this, as that tends to affect program organization overall, forcing shape on the other code. There is one Contents file per |repository X architecture| (the "source" pseudo-architecture does not currently provide a Contents file). Most installations are likely to have between one and three Contents files ("main", "contrib", and "non-free" for some primary repository), where one of these files is very much larger than the other two. As an example: [skynet](0) $ ls -l /var/cache/apt/apt-file/ 82015 ftp.us.debian.org_debian_dists_unstable_contrib_Contents-amd64.gz 25868596 ftp.us.debian.org_debian_dists_unstable_main_Contents-amd64.gz 794733 ftp.us.debian.org_debian_dists_unstable_non-free_Contents-amd64.gz 1330 www.sprezzatech.com_apt_dists_unstable_contrib_Contents-amd64.gz 4551278 www.sprezzatech.com_apt_dists_unstable_main_Contents-amd64.gz 28211 www.sprezzatech.com_apt_dists_unstable_non-free_Contents-amd64.gz [skynet](0) $ Ratios of length are: 5.68, 5.73, 9.69, 2.91, 21.21. Leaving out the second set of repositories, they're 32.55 and 9.69 for Debian and 161.33 and 21.21 for SprezzOS. Either way, there's about three orders of magnitude difference between the smallest and the largest. Remember that these are compressed versions of the files -- we'll be scanning proportionately more text. So, we can parallelize across the files, but that's of limited utility; the largest file(s) are likely to strongly dominate total execution time. Nonetheless, we have gone ahead and done this. This has the added benefit that if there is a problem reading any particular file (i.e. I/O delay due to a bad sector), the other files can still be processed. Much more benefit can be had by parallelizing within a file (recall that apt-file(1) already enjoys producer:consumer parallelism within a file by its multiprocess approach). Remembering that we never want more threads in the running state than there are available processors, there's three major approaches we could pursue: (a) one thread inflates to multiple buffers in succession. worker threads start in stepwise fashion as buffers become available. (b) worker threads lock on the file being inflated; upon taking ownership, they inflate a chunk to their own buffer, and work on it. (c) worker threads statically chunk up the file to be inflated, then seek out the first zlib breakpoint following the start of their chunk. this is similar to how apt-show-versions works, since it doesn't know stanza boundaries ahead of time. Options (a) and (b) are very similar -- they are equivalent for small files ("small" here meaning less than the product of chunk size and processor count). The difference is that when all worker threads are occupied in (a), the inflating thread continues to run, whereas inflation pauses when all worker threads are running in (b). A fundamental advantage of (b) is that it uses a fixed amount of memory small relative to large files, whereas (a) can require up to the full inflated size (this occurs if the inflating thread runs ahead of the worker threads). Option (b) has threads waiting whenever they finish at the same time. Option (a) has threads waiting whenever any thread finishes before the inflation thread has made a chunk available. It seems clear that (b) is the better choice of the two. Option (c) is interesting. It does a bit more work than (b), but eliminates all delays (most importantly the compulsory startup delays) provided a friendly chunk size during deflation. Should processing time be similar to inflation time, this could become important. Currently, processing time is roughly characterized as at least twice inflation time, and I'm not very knowledgeable regarding zlib chunking characteristics, so this added complexity is not planned. It's worth considering later, though, especially as the serial nature of inflation will become more and more apparent as processor count increases. Finally, we want to ensure we launch an optimal number of threads -- we don't want to launch 8 threads on either a dual-core, power-limited tablet or a 1024-core supercomputer. Libblossom allows us to launch a thread per processing unit. Rather than parallelizing directly within a directory or within a file, then, we parallelize on a queue of file-offset objects. This queue has a fixed upper size: one descriptor per thread (we'll never have 9 files open if we have only 8 threads), where a descriptor is a map, a total map length, and the offset through which processing has been scheduled. Once a map is fully scheduled, its descriptor is removed. If there are no descriptors on the queue, perform a readdir_r() on the DIR pointer (serialized by the kernel), and possibly add a new descriptor to the queue (if the file returned by readdir_r() is too large to process in one chunk). We need unlock before calling readdir_r(), but in this case we have to relock to add a new queue element. We could eliminate this by always performing a readdir_r(), and simply adding the new element if there's an element to work on already when we acquire the queue, cutting some overhead. This would violate our assumption that we never have more files open than we have threads, however, meaning that we'd no longer have a fixed queue size. This is no great loss, however; the fixed size was more a nice property that fell out of our design than a real goal. We can use a dynamic array and indices rather than pointers, or even dynamically-allocated queue nodes, without much cost. Let's take this idea further (d): rather than adding the new descriptor to the queue and taking the front work element, the thread could immediately begin working on the new descriptor. Enqueue a new node if the new descriptor is not completely processed, and repeat this process until readdir_r() is exhausted. This implies state equal to the aggregate length of the files, but means that there is never delay blocking on another thread within a file. Of course, we achieved this in (c) anyway, at the cost of a bit extra work. Option (d) only improves on (c) in the fairly unlikely event that there are more files than there are threads, burns probably at least as much extra work with all those extraneous readdir_r() calls, and eats up more memory when it does provide a benefit. We will go with (c). Within a thread, we will use Boyer-Moore-Galil or perhaps Apostolico–Giancarlo or Volnitsky (the cases for which AG beats BMG can be considered pathological for apt-file(1), but hey, completion is good. Volnitsky is Wu-Manber-ish (read: hash-based) and difficult to characterize. BMG is well-adapted to large alphabets and low repetition of letters within a search term, which accurately describes our search space. Closing note ============ This has all assumed purely ASCII needles and haystacks, which seems valid from what I've seen of the Contents files. This seems a pretty shortsighted policy, though. If someone wants to point out chapter and verse where packages are restricted to ASCII filenames in Debian, that'd comfort me a bit. I'd go look it up, but I've been writing this for like four hours. -- nick black http://www.sprezzatech.com -- unix and hpc consulting to make an apple pie from scratch, you need first invent a universe.
signature.asc
Description: Digital signature