On 2001-01-18, =?iso-8859-1?q?Mukul=20Sabharwal?=
<[EMAIL PROTECTED]> wrote:
> how do sites like http://marc.theaimsgroup.com, make
> mailing list indexes, like even phpbuilder and zend's
> site which archive the list.
> How do they index all the message ?
Well, this is pretty OT, but since my alternative is to keep working on a
boring report I'm writing, I'll waste some bandwidth... ;)
We (marc.theaimsgroup.com, which is also lists.kde.org) use a mishmash of
really ugly perl that I (mostly) have written over the past ~4 years.
There were a few design priorities / realities that played a big part in
how we ended up doing things:
-We hated the "huge index page" problem that Hypermail often gets into
-We hated the "searches pick up the text from forward/back links" problem
that ht://Dig and glimpse often got into
-We hated the "generate different links to the same document so your
browser history is completely confused" problem that DejaNews got into
-We hated banner ads and other useless fluff that makes accessing content
painful and slow
-We hated sites that were not lynx-friendly
-I was the primary author, and I knew little about perl and less about
databases when we started (and have been stuck with my mistakes since...)
-I am lazy and busy, a bad combination
First, a database-insertion script takes messages in from sendmail and
stuffs them into a database of message-headers and message-bodies. From
there the messages get associated with the appropriate list, thread, and
author and get full-text, author-, and thread-indexed by some background
daemons. We use MySQL and a fairly uninspired database layout (which I've
been slowly evolving behind the scenes as I pick up a clue here and there
about DB design) to store everything.
The full-text index architecture is probably the best part (read: I had the
most help from "real database programmer" coworkers when it was
architected). RDBMS's really aren't the "right" tool for such a job, but
we wanted to do it that way anyway, and haven't been disappointed. For
each message/thread/author, a unique list of words is built. A
"dictionary" table maps from word:wordid where wordid is an int. Each
message/thread/author has a unique int id as well. wordid:msgid (or thrdid
or authid) pairs are stored in index tables (with a composite primary key
on wordid:msgid, thus insuring unique pairs and also making a "select *
where wordid=49" quite fast). This means we throw away all positional
information (which is why things like phrase-searching aren't supported
yet).
The web interface is a big honking perl script which wishes it could grow
up to be clean code running under mod_perl (so it could be as fast as all
those PHP sites :), but I'm too lame^H^H^H^Hbusy to ever get around to
doing the rewrite. When a full-text search is done, the wordid's of all
input words are grabbed from the dictionary. Then the appropriate lookup
table (we keep a seperate table per list, a design tradeoff) is joined with
itself with something like (a sample 3-word query here):
SELECT w1.msgid FROM wrdmsg AS w1, wrdmsg AS w2, wrdmsg AS w3
WHERE w1.wordid=search_term_wordid_1
AND w2.wordid=search_term_wordid_2
AND w3.wordid=search_term_wordid_3
AND w1.msgid=w2.msgid
AND w1.msgid=w3.msgid
...With appropriate LIMITs, etc. These msgids are then used to pull up the
appropriate message header info to print the results of a search. The
reason for the word -> wordid mapping is because straight numeric
comparisons are much more efficient for the DB engine to make (and also
much smaller on-disk to repeat wordid's once per message than words).
When we designed this years ago we tested a few different algorithms, ways
of storing the data, etc. We did quick mock-up tests simulating up to
about 100,000 messages, and were pretty sold on this architecture. One
thing it is not is terribly disk-efficient, but it performs pretty well (at
4.3 million mails now and counting...). The site's hardware, connectivity,
and basic care & feeding are donated by the AIMS Group guys (the same
people, under a different name, that created the site with me years ago).
It's currently running on a dual P2-400--nice, but not amazing hardware.
There are lots of ways it could be better; sometimes I wish it was my day
job so I could Do It Right, other times I'm glad it's just a hobby ;)
--
Hank Leininger <[EMAIL PROTECTED]>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]