On Tue, 20 Jan 2004 16:37:27 -0500 (EST), Charles Gregory <[EMAIL PROTECTED]> writes:
> I'm starting to see mail with TEXT obfuscation, such as: > I heard you need viagrPa. > Note the capital P thrown in to our favorite 'v' word. > It is really beginning to look like we need a genuine spelling checker, or > some sort of 'approximation' technology, if such exists. There is no > 'pattern' I can think of to defeat this mis-spelling spam in any other > way. For obfuscations of abcfffg: Basically, we transform abcfffg into: a?bcfffg ab?cfffg abc?fffg abcf?ffg abcff?fg abcfff?g abcfffg? to deal with any one single missing letter, and then put: ([ /_-]*|.?) between each one. That can represent both a set of seperator non-word characters *and* a single any-character. Giving us seven lines like: a?([ /_-]*|.?)b([ /_-]*|.?)c([ /_-]*|.?)f([ /_-]*|.?)f([ /_-]*|.?)f([ /_-]*|.?)g([ /_-]*|.?) { 1; } This, or variants of it won't catch every word obfuscation, but these should be somewhat more robust against FP's and may make it a lot easier on bayes. A second option is to not do this as a rule, but do this sort of obfuscation-analysis only on new tokens. If a token has never before been seen, but it appears close to what seems to be an obfuscated bad-word, we assign it a provisional spam-probability when doing baysean analysis. Scott ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk