Thinking over the topic of mispellings and phrases that commonly appear in
spams,
I'd like to be able to run a few simple programs to pull out the words that
appear
in a collection of spam and ham, to find the obvious mispellings, and
sequences
of words that commonly appear in spam and differentiate spam from ham.

I realize this is the kind of thing that Bayes does, and certainly SA's
rules have
been built to tag commonly occurring spam
phrases, but it might be useful sometimes to have the clear text form of the
messages
available to make it easy to test various rules against them.

Is there some sort of low-level way to access SA's message processing
routines, to
produce messages where (1) all text (or html) parts have been decoded into
text,
and (2) html has been in turn converted into clear, linear text?

Basically what I'd like to do is run formail over an mbox of messages, and
produce
an equivalent mbox with the encoded and html parts converted to clear text:

   formail -s convert_to_clear_text < original.mbox > clear_text.mbox

where 'convert_to_clear_text' is presumably a Perl progam that invokes the
necessary
parts of SA to perform the necessary decoding and converstion.




-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to