Hello 9fans, I have written a text database named Kirara. The following is a brief introduction to Kirara. If you are interested in, get Kirara from: http://plan9.aichi-u.ac.jp/netlib/kirara/
Kenji Arisawa ------------- Kirara ------------- Kirara is a text indexing/retrieval tool for Plan 9. Personal use: index/retrieve local files. Kirara is based on the idea similar to Glimpse. (1) indexing + grep (2) multi-level indexing (a) small space for indexing (b) small update time (c) quick search Note that: small indexing <-> quick search Kirara makes more index -> quick search Glimpse is single-level indexing. ------------- Query Kirara does not support phrase search. The database is index of words, supporting: QE mode (query expression mode) '&', '|', '*' The example: 'snoopy&html' 'snoop*&htm*' RE mode (regular expression mode) '&', RE where RE denotes regular expression. The example: 'sn.*y&h.+l' RE mode is a bit slow. (a few second.) ------------- Words Two or more runes. All words are converted to lower case. In English, words is composed of alphabets. The number of runes is configurable Assumption: Text is composed of space-separated words popular in English and many European Languages, but not in Japanese. ------------- The user's interface Best match with Rio term% kfind snoop G snoop /sys/src/9/ip/ G snoop /sys/src/cmd/spell/ G snoop /sys/src/9/kw/ ... term% G snoop /sys/src/9/ip devip.c:34: Qsnoop, devip.c:95: case Qsnoop: devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400, dp); ... Note that: two steps 1. find directories 2. find files and the contents Step 2 is actually 'grep'. we can use RE. Two-steps search is not a weekness, but a desirable feature. Because we have so many files that are hit by the query. ------------- The organization My example /n/other/kirara/sysdb target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc) /n/other/kirara/usrdb target: $home/^(bin/rc lib netlib doc adm issues srclib src sources) Indexing target is fully configurable. ------------- Multi-Level Indexing (1) Indexing (top level) word to directory mapping sysdb/index # main index # used for RE mode sysdb/mindex # meta index (alphabetic index) # used for QE mode sysdb/dind/* # rough index of each directory sysdb/QTDir # map table (QID, mtime, path-to-dir) index # word to dir QID aa 0000000000014f0a aa 000000000001a1e0 aa 000000000001a26e mindex # word to range in index aa 0 126669 ab 126669 491569 ac 491569 1258566 ad 1258566 1852467 ... dind/* # `*' is a directory QID 0000000000014f05 0000000000014f0a 000000000001a1ce usrdb is same. (2) Indexing (directory level) # optional word to file mapping sysdb/find/*/ind.gz # fine index of the directory (gzipped) sysdb/find/*/qtn # map table (QID, mtime, name) where `*' is a directory QID usrdb is same as sysdb. ------------- Experiment (a) hardware GA-H61M-USB3-B3 Intel Pentium G860 (3GHz) DDR3 PC3 4GB (b) software 9front cwfs64x ------------- The performance (compression ratio) target target num_of_dirs indexing sysdb: 556 MB 1790 dirs 49 MB usrdb: 6620 MB 8948 dirs 150 MB compression ratio: 49/556 (sysdb) note: usrdb includes many non-text file. ------------- The performance (retrieval time) system dependent RQ search # kfind foo 0.1 seconds. It is not important to make this time smaller. (sufficiently small) RE search # kfind -r foo a few seconds ------------- The performance (construction/update) (a) Construction time system dependent Initial construction need 10 minutes for sysdb 30 minutes for usrdb (b) Updating time two commands for update mkdb 20 seconds to a few minutes for usrdb depends largely on state of cache mkdb1 (currently only for usrdb) 5 to 15 seconds for usrdb mkdb1 needs event log ------------- Scalability Main factors (a) retrieval time QE search: proportional to number of dirs that include the query RE search: proportional to size of index (b) initial construction time proportional to total data (c) update time mkdb: proportional to number of dirs and the changes mkdb1: proportional to changes and size of index ------------- Used Tools (1) rc (2) grep, sed, awk, sort, diff, gzip, ... (3) some new tools written in C ------------- What Kirara means? Kirara is name of a girl that appeared in a Japanese comic book. (But I have never read the book.) The name is seldom used in real world. From the name we Japanese imagine something glittering. I like the name. ------------- References [1] GLIMPSE: A Tool to Search Through Entire File Systems Udi Manber and Sun Wu (1993) http://webglimpse.net/pubs/glimpse.pdf [2] Glimpse Documentation http://webglimpse.net/gdocs/glimpsehelp.html