Re: [9fans] text database Kirara

brz-systemd-dev Tue, 06 Aug 2013 11:22:48 -0700

I've played around with Kirara for a couple hours, now, and am pretty
surprised at how simple it is.  It's already become integrated into my
workflow.  Being able to quickly (and easily) search for relevant
snippets of code throughout the system is quite useful.


I feel compelled to mention that the code is abnormally high in
quality.  (This is seen, even in the rc scripts)
Now I'm going to have to look through your other projects.

Thanks for releasing this.

- BurnZeZ

Bug:
        kirara-1.1/INSTALL:9: mkdir -p $kirarar/bin/^(rc $objtype)
        Here (and on line 11), '$kirarar' is used instead of '$kirara'.

--- Begin Message ---

Hello 9fans,

I have written a text database named Kirara.
The following is a brief introduction to Kirara.
If you are interested in, get Kirara from:
  http://plan9.aichi-u.ac.jp/netlib/kirara/

Kenji Arisawa

-------------

Kirara

-------------

Kirara is a text indexing/retrieval tool for Plan 9.

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing   <->    quick search
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara does not support phrase search.
The database is index of words,

supporting:
QE mode (query expression mode)
        '&', '|', '*'
The example:
        'snoopy&html'
        'snoop*&htm*'

RE mode (regular expression mode)
        '&', RE
where RE denotes regular expression.
The example:
        'sn.*y&h.+l'

RE mode is a bit slow. (a few second.)

-------------

Words

Two or more runes.
All words are converted to lower case.
In English, words is composed of alphabets.
The number of runes is configurable

Assumption:
Text is composed of space-separated words
popular in English and many European Languages,
but not in Japanese.

-------------

The user's interface

Best match with Rio
        term% kfind snoop
        G snoop /sys/src/9/ip/
        G snoop /sys/src/cmd/spell/
        G snoop /sys/src/9/kw/
        ...
        term% G snoop /sys/src/9/ip
        devip.c:34:     Qsnoop,
        devip.c:95:     case Qsnoop:
        devip.c:98:             devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 
0400, dp);
        ...
Note that: two steps
1. find directories
2. find files and the contents
Step 2 is actually 'grep'. we can use RE.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The organization

My example

/n/other/kirara/sysdb
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

/n/other/kirara/usrdb
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)

Indexing target is fully configurable.

-------------

Multi-Level Indexing

(1) Indexing (top level)
word to directory mapping

sysdb/index             # main index                                    # used 
for RE mode
sysdb/mindex    # meta index (alphabetic index) # used for QE mode
sysdb/dind/*    # rough index of each directory
sysdb/QTDir             # map table (QID, mtime, path-to-dir)

index           # word to dir QID
        aa      0000000000014f0a
        aa      000000000001a1e0
        aa      000000000001a26e

mindex  # word to range in index
        aa 0 126669
        ab 126669 491569
        ac 491569 1258566
        ad 1258566 1852467
        ...
dind/*                          # `*' is a directory QID
        0000000000014f05
        0000000000014f0a
        000000000001a1ce

usrdb is same.

(2) Indexing (directory level)  # optional
word to file mapping

sysdb/find/*/ind.gz     # fine index of the directory (gzipped)
sysdb/find/*/qtn        # map table (QID, mtime, name)
where `*' is a directory QID

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 4GB

(b) software
9front
cwfs64x

-------------

The performance (compression ratio)

target   target  num_of_dirs      indexing
sysdb:   556 MB    1790 dirs             49 MB
usrdb:  6620 MB    8948 dirs            150 MB

compression ratio: 49/556 (sysdb)
note: usrdb includes many non-text file.

-------------

The performance (retrieval time)
system dependent

RQ search       # kfind foo

0.1 seconds.

It is not important to make this time smaller.
(sufficiently small)

RE search       # kfind -r foo

a few seconds


-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction
need
        10 minutes for sysdb
        30 minutes for usrdb

(b) Updating time
two commands for update

mkdb
        20 seconds to a few minutes for usrdb
        depends largely on state of cache

mkdb1 (currently only for usrdb)
        5 to 15 seconds for usrdb
        mkdb1 needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query
RE search: proportional to size of index

(b) initial construction time
proportional to total data

(c) update time
mkdb:   proportional to number of dirs and the changes
mkdb1:  proportional to changes and size of index

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something glittering.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html

--- End Message ---

Re: [9fans] text database Kirara

Reply via email to