On 14 Dec 2003, Rubin Bennett wrote: > The spammers are getting smarter about Bayes... this one sneaked through > SA 2.6, a well trained Bayes database, and the BigEvil rules with a > score of 1.0 out of 5.
The body of the spam in question was more than 80% Bayes- poison. It's not surprising that it got through. Probably the spamvertised website isn't in the BigEvil rules yet (since those are manually generated, there's a delay from when a spammer registers a new site and when they get added to BE). I've got a prototype eval() that looks at the incidence of what I tentatively call "smallwords" (the "glue" words like "is", "a", "the", "and", &c.) that hold the English language together and flags a message if the percentage of such words is too low. It's only moderately useful to date (needs a lot more tuning) and it's only useful for English-language messages. In the case of this message, the number of "smallwords" was an astonishing 0.21 percent; normal English seems to run around 3.5 percent. If nothing else, it seems to be a decent test for Bayes poison. > What to do? I'm sure that the sleazebags that come up with > these will send many of them now that they've figured out > it works. It's starting to look like we're going to have to teach SA > how to read full sentences and/ or paragraphs, and making sure that it > can pick out when someone dumps a random collection of hammy words into > a message... There's also the issue here that the message was bad (and I mean *really* bad) HTML. Spammers don't seem to grok the fact that there's a published standard for HTML, and that HTML tags actually have *purpose* other than to obfuscate and hide words designed to foil spam-filters. Perhaps if browser- designers would do a better job of not *hiding* bad HTML, but rather *showing* it, we might be better off.... Obviously, running any incoming message through an HTML- validator is cycle-prohibitive, so we need something else.... I just took a look at the distribution of HTML tag-lengths by starting letter and found some potentially useful data. The longest HTML tag is "BLOCKQUOTE" which rings in at ten characters, so any tag (or bogus attempt at a tag, if you get my drift) that begins with the letter "B" can be 10 characters long. That's a heck of a lot of potential words. However, tags that begin with "D", "E", "K", "U", and "V" are at most 3 characters long! But wait, it gets better! "E" begins tags that are, at most, 2 characters long; "Q" and "U" only 1, and tags beginning with "G", "J", "R", "W", "X", "Y", and "Z" don't exist at all! With the above, it should be possible to come up with a small number of extra tests (would probably have to be meta tests to make sure one is analysing an HTML (or HTML wannabe) message) that would find Bayes-poison embedded in bogus HTML markup brackets. E.g.: rawbody CRF_BOGUSHTML /<\/?r.{0,10}>/i describe CRF_BOGUSHTML Contains an unrecognised HTML tag score CRF_BOGUSHTML some.score The above would nail, straight away, the bogus tag in your spam of "<residue>". We may not be able to make them stop spamming us, but at least we can encourage them to use real HTML.... +------------------------------------------------+---------------------+ | Carl Richard Friend (UNIX Sysadmin) | West Boylston | | Minicomputer Collector / Enthusiast | Massachusetts, USA | | mailto:[EMAIL PROTECTED] +---------------------+ | http://users.rcn.com/crfriend/museum | ICBM: 42:22N 71:47W | +------------------------------------------------+---------------------+ ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk