Re: [SAtalk] Parsing almost-certainly-spam and probably-spam files

Bob Apthorpe Wed, 27 Aug 2003 06:26:49 +0000

On 26 Aug 2003 14:06:56 -0400 K Old <[EMAIL PROTECTED]> wrote:

> Hello everyone,
> 
> I'm using the 2.55 version of SA and everything works great.  I'm trying
> to find a good way to parse the almost-certainly-spam and probably-spam
> files that are produced by SA.Given that the majority of the mail that
> makes it in these files is spam, every now and then a valid message will
> be tagged as spam and I need to restore it.  So, I look through these
> files to verify that all of the emails are indeed spam.  Thing is, it's
> time consuming.


Just out of curiosity, how many messages are you scanning through
(order-of-magnitude)?

> I'm writing a perl script that will strip out the From, Subject and
> X-Spam-Status and all that is fine.  The kicker is that when
> SpamAssassin writes the messages (in mbox format) to the file, it writes
> two.  One containing the SpamAssassin flags, etc. and the other is the
> original message which is left untouched so that restoring it is easy.

I think what you're seeing is the initial header from the SA-tagged
message followed by the original headers attached in message/rfc822
format. Ideally, you'd be able to strip the SpamAssassin 'wrapper' and
extract the message/rfc822 attachment, which should be easy if you can
extract the full message from the mbox and pipe it through `spamassassin
-d`. The trick now is extracting the full, individual messages from the
mbox format.

> I've looked at a few modules on CPAN, but haven't parsed mbox files
> before, and would like suggestions.  From what I understand if I can
> just get every other message I'll get what I need.

Better to let perl do the heavy lifting rather than guessing which
pieces of the whole mbox file you need. Take a look at

Mail::MboxParser
Mail::Mbox::MessageParser
Mail::Box
Mail::Util

for info on parsing mbox files, and

Mail::SpamAssassin->remove_spamassassin_markup()
Mail::Internet

for manipulating individual messages.

> Any advice/suggestions?

I'd probably use one of the first four modules to extract a list of
messages from the mbox file, then convert each of those messages into
Mail::Internet objects to analyze the appropriate headers, and strip off
the original SA tagging of suspected false positives with
Mail::SpamAssassin. At that point, you can do whatever you want with the
suspected FPs.

hth,

-- Bob


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Parsing almost-certainly-spam and probably-spam files

Reply via email to