On Sun, 14 Dec 2003 14:23:21 -0500 (EST), "Carl R. Friend" <[EMAIL PROTECTED]> writes:

>    I've got a prototype eval() that looks at the incidence of
> what I tentatively call "smallwords" (the "glue" words like "is",
> "a", "the", "and", &c.) that hold the English language together
> and flags a message if the percentage of such words is too
> low.  It's only moderately useful to date (needs a lot more
> tuning) and it's only useful for English-language messages.
> In the case of this message, the number of "smallwords" was
> an astonishing 0.21 percent; normal English seems to run
> around 3.5 percent.  If nothing else, it seems to be a decent
> test for Bayes poison.

Until they start building text that models english more accurately.
Every model used for checking a spam for 'is it english' could be
reused as a model capable of generating fake text that passes the same
test. For instance, if a tester uses a monogram language model, as you
suggest, then I take your exact model and turn it into a generator
that uses exactly the number of 'is' 'a' 'the' to fool the same test.

Actualy, they could do much better---bigram word language model or
even a generative grammar that could make very realistic sentences.
Moby dict, OCaml, and a weekend.

>    There's also the issue here that the message was bad (and
> I mean *really* bad) HTML.  Spammers don't seem to grok the
> fact that there's a published standard for HTML, and that HTML
> tags actually have *purpose* other than to obfuscate and hide
> words designed to foil spam-filters.  Perhaps if browser-
> designers would do a better job of not *hiding* bad HTML, but
> rather *showing* it, we might be better off....

Definitely! I'd suggest perhaps have a learning model to learn HTML
tag bodies. Once its seen, say, 100 nonspam HTML email, it then
considers all new HTML tags as bad.

We could also use this sort of scheme for Bayes. As the amount of
training data increases, we consider any new token to become more
spammy. IE, at 200 trained emails, a new token is neutral. At 1000,
its 70% spam. At 10,000 its considered 95% spam.

>    Obviously, running any incoming message through an HTML-
> validator is cycle-prohibitive, so we need something else....

An approximation that just checks to make sure each tag is valid
should be more than fast enough. It wouldn't catch bad nesting
structure, but be pretty reliable at punishing random nonsense within
a tag.

My rough idea is that we don't have to necessarily detect obfuscation
and bayes poison. If we can just keep the spammers from hiding it from
the customers. Once the email turns into line-noise like:

   <font size=3D"+2">No<residue>w y</clutter>ou=20 c<azimuth>an
   ha</thiamin>ve=20 HU<sauerkraut>NDR</chronicle>ED<danzig>S of=20
   le</annie>nd<seamen>ers co</opponent>mpete=20 f<rangeland>or
   y</blood>ou<dime>r lo</bocklogged>an!  <BR><BR>R<kulak>ATE</level>S
   A<lectionary>S LO</mcgrath>W=20

then the spammer DEFINITELY isn't gonna get any sales.

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to