Re: [9fans] text database Kirara

Francisco J Ballesteros Tue, 06 Aug 2013 00:28:13 -0700

nice. btw I think I have another tag tool at contrib. not sure, but it was 
called tags, and the search tool was called F.


I say this because it used file and per file type tag listing (ms2txt, etc) to 
index other file types. It might be an idea to add to your nice tool. 

On Aug 6, 2013, at 3:14 AM, arisawa <aris...@ar.aichi-u.ac.jp> wrote:

> Hello 9fans,
> 
> I have written a text database named Kirara.
> The following is a brief introduction to Kirara.
> If you are interested in, get Kirara from:
>  http://plan9.aichi-u.ac.jp/netlib/kirara/
> 
> Kenji Arisawa
> 
> -------------
> 
> Kirara
> 
> -------------
> 
> Kirara is a text indexing/retrieval tool for Plan 9.
> 
> Personal use: index/retrieve local files.
> 
> Kirara is based on the idea similar to Glimpse.
> 
> (1) indexing + grep
> (2) multi-level indexing
> 
> (a) small space for indexing
> (b) small update time
> (c) quick search
> 
> Note that:
> small indexing   <->    quick search
> Kirara makes more index -> quick search
> Glimpse is single-level indexing.
> 
> -------------
> 
> Query
> 
> Kirara does not support phrase search.
> The database is index of words,
> 
> supporting:
> QE mode (query expression mode)
>    '&', '|', '*'
> The example:
>    'snoopy&html'
>    'snoop*&htm*'
> 
> RE mode (regular expression mode)
>    '&', RE
> where RE denotes regular expression.
> The example:
>    'sn.*y&h.+l'
> 
> RE mode is a bit slow. (a few second.)
> 
> -------------
> 
> Words
> 
> Two or more runes.
> All words are converted to lower case.
> In English, words is composed of alphabets.
> The number of runes is configurable
> 
> Assumption:
> Text is composed of space-separated words
> popular in English and many European Languages,
> but not in Japanese.
> 
> -------------
> 
> The user's interface
> 
> Best match with Rio
>    term% kfind snoop
>    G snoop /sys/src/9/ip/
>    G snoop /sys/src/cmd/spell/
>    G snoop /sys/src/9/kw/
>    ...
>    term% G snoop /sys/src/9/ip
>    devip.c:34:    Qsnoop,
>    devip.c:95:    case Qsnoop:
>    devip.c:98:        devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400, 
> dp);
>    ...
> Note that: two steps
> 1. find directories
> 2. find files and the contents
> Step 2 is actually 'grep'. we can use RE.
> 
> Two-steps search is not a weekness, but a desirable feature.
> Because we have so many files that are hit by the query.
> 
> -------------
> 
> The organization
> 
> My example
> 
> /n/other/kirara/sysdb
> target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)
> 
> /n/other/kirara/usrdb
> target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)
> 
> Indexing target is fully configurable.
> 
> -------------
> 
> Multi-Level Indexing
> 
> (1) Indexing (top level)
> word to directory mapping
> 
> sysdb/index        # main index                    # used for RE mode
> sysdb/mindex    # meta index (alphabetic index)    # used for QE mode
> sysdb/dind/*    # rough index of each directory
> sysdb/QTDir        # map table (QID, mtime, path-to-dir)
> 
> index        # word to dir QID
>    aa    0000000000014f0a
>    aa    000000000001a1e0
>    aa    000000000001a26e
> 
> mindex    # word to range in index
>    aa 0 126669
>    ab 126669 491569
>    ac 491569 1258566
>    ad 1258566 1852467
>    ...
> dind/*                # `*' is a directory QID
>    0000000000014f05
>    0000000000014f0a
>    000000000001a1ce
> 
> usrdb is same.
> 
> (2) Indexing (directory level)    # optional
> word to file mapping
> 
> sysdb/find/*/ind.gz    # fine index of the directory (gzipped)
> sysdb/find/*/qtn    # map table (QID, mtime, name)
> where `*' is a directory QID
> 
> usrdb is same as sysdb.
> 
> -------------
> 
> Experiment
> 
> (a) hardware
> GA-H61M-USB3-B3
> Intel Pentium G860 (3GHz)
> DDR3 PC3 4GB
> 
> (b) software
> 9front
> cwfs64x
> 
> -------------
> 
> The performance (compression ratio)
> 
> target     target     num_of_dirs      indexing
> sysdb:   556 MB       1790 dirs         49 MB
> usrdb:    6620 MB       8948 dirs        150 MB
> 
> compression ratio: 49/556 (sysdb)
> note: usrdb includes many non-text file.
> 
> -------------
> 
> The performance (retrieval time)
> system dependent
> 
> RQ search    # kfind foo
> 
> 0.1 seconds.
> 
> It is not important to make this time smaller.
> (sufficiently small)
> 
> RE search    # kfind -r foo
> 
> a few seconds
> 
> 
> -------------
> 
> The performance (construction/update)
> 
> (a) Construction time
> system dependent
> 
> Initial construction
> need
>    10 minutes for sysdb
>    30 minutes for usrdb
> 
> (b) Updating time
> two commands for update
> 
> mkdb
>    20 seconds to a few minutes for usrdb
>    depends largely on state of cache
> 
> mkdb1 (currently only for usrdb)
>    5 to 15 seconds for usrdb
>    mkdb1 needs event log
> 
> -------------
> 
> Scalability
> 
> Main factors
> 
> (a) retrieval time
> QE search: proportional to number of dirs that include the query
> RE search: proportional to size of index
> 
> (b) initial construction time
> proportional to total data
> 
> (c) update time
> mkdb:    proportional to number of dirs and the changes
> mkdb1:    proportional to changes and size of index
> 
> -------------
> 
> Used Tools
> 
> (1) rc
> 
> (2) grep, sed, awk, sort, diff, gzip, ...
> 
> (3) some new tools written in C
> 
> -------------
> 
> What Kirara means?
> 
> Kirara is name of a girl that appeared in a Japanese comic book.
> (But I have never read the book.)
> The name is seldom used in real world.
> From the name we Japanese imagine something glittering.
> I like the name.
> 
> -------------
> 
> References
> 
> [1] GLIMPSE: A Tool to Search Through Entire File Systems
> Udi Manber and Sun Wu (1993)
> http://webglimpse.net/pubs/glimpse.pdf
> [2] Glimpse Documentation
> http://webglimpse.net/gdocs/glimpsehelp.html
>

Re: [9fans] text database Kirara

Reply via email to