> It would seem to me that, for purposes of rule simplification, that the subject and body of messages to be scanned should be available in pre-processed flavors, some of which is currently available. Assume the spam key is some thing like that Vuhee drug, V=P i=o e=a n=g s=r u=a (i.e. Poensu) > > RAW untouched > > RARE (de-mimed) eye-readable 8/16 charset with HTML intact
> FOLDED set all lowercase > Remove HTML > punctuation to be underscore, > repeated punctuation collapsed to 1 instance > > P.o;[EMAIL PROTECTED] becomes p_o_3_n_s_u > > PLAIN all lowercase remove all punctuation > P.o;[EMAIL PROTECTED] becomes po3nsu > > ALPHED strip numerics as well > P.o;[EMAIL PROTECTED] becomes ponsu > Rules would be defined with along the same lines as currently done for Subject and body, e.g. Subject-PLAIN, Body-ALPHED etc. Bayes should tokenize the most reduced (ALPHED) stream. I wonder if there would be any benefit in adding an inverse ALPHED stream? Would Bayesian classification of a stream, where all alpha was stripped, leaving just the special chars and numbers improve the SPAM detection rate? Best regards, Bob ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk