As most of you are aware, I have for the last two years been writing a mailing list archiver called 'lurker' (.sf.net). Recently, all the details came together in a nicely unified whole which seems quite stable. As a debian nut, I am highly interested in seeing lurker used on the debian lists and have set most of my design requirements to meet this goal.
To summarize why lurker is good for debian: it scales to the volume of debian email with full-text search it supports multiple character sets in headers + body the threading is much more useful (imo) cross-posting is understood and works with threading attachments such as patches and signatures are treated correctly the debian archives were its testcase data The biggest problem deploying it is that debian has a lot of mail. The archive is so large that the lurker database will probably exceed 2Gb which means that the system running lurker must have LFS support. The other issue is that lurker must have all of the mail for a single mailing list in one mailbox. This may cause problems since debian people like to use mutt on the archived mailboxes and such a mbox would be far too large. Let me briefly outline why lurker has one mailbox per list: 1. it makes automatic database upgrades possible after format changes (lurker can simply regenerate its database using the mailboxes in its database dir without having to flounder about asking for help) 2. it keeps the option of opening all the mailboxes at once available (prior versions of lurker did not have enough file descriptors for some really large mailing lists) 3. it makes lurker-index have the fire-and-forget property: simply piping a message into lurker-index is sufficient to have it looked after. you don't need to worry about keeping the source mailbox in the same location or touching the source mailbox with an editor, or even keeping it 4. it keeps users from poking the mailbox when it is "part of the database" 5. it is much simpler for the lurker code -> more robust Of these I think 3 is the most important. Versions of lurker prior to v0.5 required the administrator to list each mailbox which comprised a mailing list in lurker's config file. If there were new monthly mailboxes, the config file needed to be updated. These mailboxes could then not be moved or modified in any way other than append. Invariably, something went wrong. Furthermore, because new messages were not fed via a push script like lurker-index, a daemon was needed to monitor all the mailboxes. This, of course, required they all be opened! This current solution otoh, is far more simple. You simply take a single message or mailbox, and pipe it through lurker-index saying which mailing list it is for. Then you never have to worry about it ever again. For these reasons I consider the new scheme to be superior from a usability and robustness stand-point. I know that it takes more disk-space which is why I only switched to at after a lot of deliberation. I just had to really convince myself that: "disk is cheap". Besides, for normal users of lurker, the mailbox does not need to be mutt accessable, so there is no need to keep another copy of the mailbox. And if it were mutt accessable, you would have to be absolutely certain mutt didn't change the Status: flag! I have ideas for deploying lurker, but will keep my mouth shut unless asked as I don't want to step on any administrator toes. I will mention however, that the current debian interface can be preserved with lurker. Specifically: The pages like http://lists.debian.org/users.html can be built as static html with a per-month perl cron job. This is because lurker message index urls are keyed by date, so one can readily hard-code a url which jumps to the current time. The page http://lists.debian.org/search.html can still be static html. Now it just submits to keyword.cgi. However, lurker searches operate differently than glimpse since lurker uses a reverse-index rather than grep. This means that the partial match, misspellings, and regexp can not be supported. Otoh, the max messages returned and date are mostly irrelevant since lurker returns results centered around a specified date. The search may then be refined at that position in time, or you can move through time--backwards, forwards, or by jumping. Finally, entries for specific search terms can be added: author, subject, thread, reply-to, message-id. For lurker-generated content, colour changes and so forth can be done with the style-sheet. More structural changes can be done by tweaking the xslt used to render html. I will make any specific UI changes required to adapt lurker's appearance to match the debian site, although this could be done by the webmaster if they are familiar with xslt. Thanks for your time! -- Wesley W. Terpstra <[EMAIL PROTECTED]>