It would seem to me that, for purposes of rule simplification, that the
subject and body of messages to be scanned should be available in
pre-processed flavors, some of which is currently available. Assume the spam
key is some thing like that Vuhee drug, V=P i=o e=a n=g s=r u=a (i.e.
Poensu)

RAW             untouched

RARE            (de-mimed) eye-readable 8/16 charset with HTML intact

FOLDED  set all lowercase
                Remove HTML
                punctuation to be underscore,
                repeated punctuation collapsed to 1 instance

                P.o;[EMAIL PROTECTED] becomes p_o_3_n_s_u

PLAIN           all lowercase remove all punctuation
                P.o;[EMAIL PROTECTED] becomes po3nsu

ALPHED  strip numerics as well
                P.o;[EMAIL PROTECTED] becomes ponsu

Rules would be defined with along the same lines as currently done for
Subject and body, e.g. Subject-PLAIN, Body-ALPHED etc. Bayes should tokenize
the most reduced (ALPHED) stream.


Best Regards
Bob
Robert Strickler


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to