On 14 Dec 2003, Rubin Bennett wrote:

> The spammers are getting smarter about Bayes... this one sneaked through
> SA 2.6, a well trained Bayes database, and the BigEvil rules with a
> score of 1.0 out of 5.

   The body of the spam in question was more than 80% Bayes-
poison.  It's not surprising that it got through.  Probably the
spamvertised website isn't in the BigEvil rules yet (since
those are manually generated, there's a delay from when a
spammer registers a new site and when they get added to BE).

   I've got a prototype eval() that looks at the incidence of
what I tentatively call "smallwords" (the "glue" words like "is",
"a", "the", "and", &c.) that hold the English language together
and flags a message if the percentage of such words is too
low.  It's only moderately useful to date (needs a lot more
tuning) and it's only useful for English-language messages.
In the case of this message, the number of "smallwords" was
an astonishing 0.21 percent; normal English seems to run
around 3.5 percent.  If nothing else, it seems to be a decent
test for Bayes poison.

> What to do?  I'm sure that the sleazebags that come up with
> these will send many of them now that they've figured out
> it works.  It's starting to look like we're going to have to teach SA
> how to read full sentences and/ or paragraphs, and making sure that it
> can pick out when someone dumps a random collection of hammy words into
> a message...

   There's also the issue here that the message was bad (and
I mean *really* bad) HTML.  Spammers don't seem to grok the
fact that there's a published standard for HTML, and that HTML
tags actually have *purpose* other than to obfuscate and hide
words designed to foil spam-filters.  Perhaps if browser-
designers would do a better job of not *hiding* bad HTML, but
rather *showing* it, we might be better off....

   Obviously, running any incoming message through an HTML-
validator is cycle-prohibitive, so we need something else....

   I just took a look at the distribution of HTML tag-lengths
by starting letter and found some potentially useful data.  The
longest HTML tag is "BLOCKQUOTE" which rings in at ten characters,
so any tag (or bogus attempt at a tag, if you get my drift) that
begins with the letter "B" can be 10 characters long.  That's a
heck of a lot of potential words.  However, tags that begin with
"D", "E", "K", "U", and "V" are at most 3 characters long!  But
wait, it gets better!  "E" begins tags that are, at most, 2
characters long; "Q" and "U" only 1, and tags beginning with "G",
"J", "R", "W", "X", "Y", and "Z" don't exist at all!

   With the above, it should be possible to come up with a small
number of extra tests (would probably have to be meta tests to
make sure one is analysing an HTML (or HTML wannabe) message)
that would find Bayes-poison embedded in bogus HTML markup brackets.
E.g.:

rawbody CRF_BOGUSHTML   /<\/?r.{0,10}>/i
describe CRF_BOGUSHTML  Contains an unrecognised HTML tag
score CRF_BOGUSHTML     some.score

   The above would nail, straight away, the bogus tag in your
spam of "<residue>".

   We may not be able to make them stop spamming us, but at least we
can encourage them to use real HTML....

+------------------------------------------------+---------------------+
| Carl Richard Friend (UNIX Sysadmin)            | West Boylston       |
| Minicomputer Collector / Enthusiast            | Massachusetts, USA  |
| mailto:[EMAIL PROTECTED]                        +---------------------+
| http://users.rcn.com/crfriend/museum           | ICBM: 42:22N 71:47W |
+------------------------------------------------+---------------------+



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to