2000-02-26-07:34:10 Thomas Roessler:
> I seem to recall that someone on this list wrote about some clever
> scripts to automate the use of glimpse and mutt to search huge
> e-mail archives. Any pointers?
I did that a while back. I can describe 'em in some detail. However,
I don't have them any more. At one point an OS upgrade left me
with glimpindex bombing off fatally, leaving crud behind that
made glimpse bomb, and when I went to check for newer versions I
discovered that it had gone away; it's now a PC-style shareware
thingie, registration required and all that. Sad to see useful code
go down.
I'm still looking for a good replacement.
But here's the strategy. The design follows from (a) searches
to find files containing a search pattern is _really_ fast, (b)
rebuilding the index is _really_ slow, and (c) the agrep tool
carries the precise same search, with the precise same criteria,
over a collection of un-indexed files.
So what I did was (of course) use Maildir, so each message was
in a separate file, exposing the mailbox folder structure to
file-oriented tools like glimpse. Incoming personal email is
delivered into ~/Maildir and mailing list traffic is automatically
filed (via procmail) into folders under ~/Mail. When I hand-file
personal correspondence it goes into folders under ~/Mail as
well. All normal. Periodically (daily, I was doing) a script
moves everything from ~/Mail/*/cur to ~/archive/Mail/*/cur (i.e.
leave new messages in place), then rebuilds the glimpseindex of
~/archive/Mail.
The search command runs the pattern over ~/archive/Mail with
glimpse, and over ~/Mail with agrep, in both cases using the "-l"
option to produce the names of files containing matches; it takes
all the matching files and symlinks 'em into a tmp Maildir
(~/tmp.$$/cur/, with empty new/ and tmp/ dirs made alongside it),
invokes mutt on that tmp folder, and finally deletes it.
This worked so well I also made a very similar system that kept an
archive of all the trouble tickets found in an in-house ticket
database, converted to look like RFC-822 messages organized in
Maildirs by date. I wrote a nearly-realtime mirroring program for
the database, and used the same strategy to provide searches for it.
The ticket system didn't have full-text indexed searches, and I
really wanted 'em.
Boy I wish I had a tool pair like glimpse and agrep.
-Bennett
PGP signature