Thinking over the topic of mispellings and phrases that commonly appear in spams, I'd like to be able to run a few simple programs to pull out the words that appear in a collection of spam and ham, to find the obvious mispellings, and sequences of words that commonly appear in spam and differentiate spam from ham.
I realize this is the kind of thing that Bayes does, and certainly SA's rules have been built to tag commonly occurring spam phrases, but it might be useful sometimes to have the clear text form of the messages available to make it easy to test various rules against them. Is there some sort of low-level way to access SA's message processing routines, to produce messages where (1) all text (or html) parts have been decoded into text, and (2) html has been in turn converted into clear, linear text? Basically what I'd like to do is run formail over an mbox of messages, and produce an equivalent mbox with the encoded and html parts converted to clear text: formail -s convert_to_clear_text < original.mbox > clear_text.mbox where 'convert_to_clear_text' is presumably a Perl progam that invokes the necessary parts of SA to perform the necessary decoding and converstion. ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk