On 2001-01-18, =?iso-8859-1?q?Mukul=20Sabharwal?=
<[EMAIL PROTECTED]> wrote:

> how do sites like http://marc.theaimsgroup.com, make
> mailing list indexes, like even phpbuilder and zend's
> site which archive the list.

> How do they index all the message ?

Well, this is pretty OT, but since my alternative is to keep working on a
boring report I'm writing, I'll waste some bandwidth... ;)

We (marc.theaimsgroup.com, which is also lists.kde.org) use a mishmash of
really ugly perl that I (mostly) have written over the past ~4 years. 
There were a few design priorities / realities that played a big part in
how we ended up doing things:

-We hated the "huge index page" problem that Hypermail often gets into
-We hated the "searches pick up the text from forward/back links" problem
  that ht://Dig and glimpse often got into
-We hated the "generate different links to the same document so your
  browser history is completely confused" problem that DejaNews got into
-We hated banner ads and other useless fluff that makes accessing content
  painful and slow
-We hated sites that were not lynx-friendly
-I was the primary author, and I knew little about perl and less about
  databases when we started (and have been stuck with my mistakes since...)
-I am lazy and busy, a bad combination

First, a database-insertion script takes messages in from sendmail and
stuffs them into a database of message-headers and message-bodies.  From
there the messages get associated with the appropriate list, thread, and
author and get full-text, author-, and thread-indexed by some background
daemons.  We use MySQL and a fairly uninspired database layout (which I've
been slowly evolving behind the scenes as I pick up a clue here and there
about DB design) to store everything.

The full-text index architecture is probably the best part (read: I had the
most help from "real database programmer" coworkers when it was
architected).  RDBMS's really aren't the "right" tool for such a job, but
we wanted to do it that way anyway, and haven't been disappointed.  For
each message/thread/author, a unique list of words is built.  A
"dictionary" table maps from word:wordid where wordid is an int.  Each
message/thread/author has a unique int id as well.  wordid:msgid (or thrdid
or authid) pairs are stored in index tables (with a composite primary key
on wordid:msgid, thus insuring unique pairs and also making a "select *
where wordid=49" quite fast).  This means we throw away all positional
information (which is why things like phrase-searching aren't supported
yet).

The web interface is a big honking perl script which wishes it could grow
up to be clean code running under mod_perl (so it could be as fast as all
those PHP sites :), but I'm too lame^H^H^H^Hbusy to ever get around to
doing the rewrite.  When a full-text search is done, the wordid's of all
input words are grabbed from the dictionary.  Then the appropriate lookup
table (we keep a seperate table per list, a design tradeoff) is joined with
itself with something like (a sample 3-word query here):

SELECT w1.msgid FROM wrdmsg AS w1, wrdmsg AS w2, wrdmsg AS w3
        WHERE w1.wordid=search_term_wordid_1
          AND w2.wordid=search_term_wordid_2
          AND w3.wordid=search_term_wordid_3
          AND w1.msgid=w2.msgid
          AND w1.msgid=w3.msgid

...With appropriate LIMITs, etc.  These msgids are then used to pull up the
appropriate message header info to print the results of a search.  The
reason for the word -> wordid mapping is because straight numeric
comparisons are much more efficient for the DB engine to make (and also
much smaller on-disk to repeat wordid's once per message than words).

When we designed this years ago we tested a few different algorithms, ways
of storing the data, etc.  We did quick mock-up tests simulating up to
about 100,000 messages, and were pretty sold on this architecture.  One
thing it is not is terribly disk-efficient, but it performs pretty well (at
4.3 million mails now and counting...).  The site's hardware, connectivity,
and basic care & feeding are donated by the AIMS Group guys (the same
people, under a different name, that created the site with me years ago).
It's currently running on a dual P2-400--nice, but not amazing hardware.  

There are lots of ways it could be better; sometimes I wish it was my day
job so I could Do It Right, other times I'm glad it's just a hobby ;)

--
Hank Leininger <[EMAIL PROTECTED]>

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Reply via email to