On Wed, 27 Aug 2003 16:59:42 +0200 (CEST)
Morten Kjeldgaard <[EMAIL PROTECTED]> wrote:

> I've noticed recently that spam is starting to creep through the cracks of
> SA. I looked at one of them, and it seems this spammer is trying to foil
> the spam-filters, especially Bayes.
> 
> The spam message contains a large number of dictionary words written in
> HTML in WHITE, so they don't appear on the spam-message itself, plus of
> course the usual spam attachments with links and gifs and stuff. Here is a
> sample of the words that appeared in the message:
> 

I'm new to the list, so if this has already been hashed to death, I
apologize. The problem you describe is exactly what motivated me to join the
list. I think I may have figured out a way to easily identify such spam, but
I'm still working on the details. If someone would care to point me to an
FAQ detailing how to add a simple test that evaluates the body of a message,
sans attachements, via an external program, that would help.

What I think may work is to use something like the Michael Haardt's
diction and style programs,

http://www.gnu.org/software/diction/diction.html

to analyse the structure of the body of the message. For instance, to style,
your message looks like this to style:

readability grades:
        Kincaid: 7.0
        ARI: 8.1
        Coleman-Liau: 8.9
        Flesch Index: 79.6
        Fog Index: 9.5
        Lix: 34.4 = school year 5
        SMOG-Grading: 8.0
sentence info:
        902 characters
        215 words, average length 4.20 characters = 1.27 syllables
        11 sentences, average length 19.5 words
        54% (6) short sentences (at most 15 words)
        27% (3) long sentences (at least 30 words)
        7 paragraphs, average length 1.6 sentences
        9% (1) questions
        72% (8) passive sentences
        longest sent 52 wds at sent 11; shortest sent 3 wds at sent 8
word usage:
        verb types:
        to be (10) auxiliary (2)
        types as % of total:
        conjunctions 3(6) pronouns 8(18) prepositions 11(23)
        nominalizations 0(1)
sentence beginnings:
        pronoun (3) interrogative pronoun (1) article (2)
        subordinating conjunction (0) conjunction (1) preposition (0)


where a typical "bayesian dodging" message looks more like this:


readability grades:
        Kincaid: 35.6
        ARI: 39.0
        Coleman-Liau: -2.7
        Flesch Index: 18.2
        Fog Index: 40.0
        Lix: 102.0 = higher than school year 11
        SMOG-Grading: 3.0
sentence info:
        445 characters
        200 words, average length 2.23 characters = 1.03 syllables
        2 sentences, average length 100.0 words
        50% (1) short sentences (at most 95 words)
        50% (1) long sentences (at least 110 words)
        1 paragraphs, average length 2.0 sentences
        0% (0) questions
        0% (0) passive sentences
        longest sent 166 wds at sent 2; shortest sent 34 wds at sent 1
word usage:
        verb types:
        to be (0) auxiliary (0) 
        types as % of total:
        conjunctions 0(1) pronouns 1(2) prepositions 0(1)
        nominalizations 0(0)
sentence beginnings:
        pronoun (0) interrogative pronoun (0) article (0)
        subordinating conjunction (0) conjunction (0) preposition (0)


Actually, I had to add one valid sentence to the bottom of the body to even
get this result, becuase without it, style thought the message was
structureless. 


The obvious problems with this sort of approach are: 

1. not all languages would be supported. Obviously, some folks would want to
disable such a test. style only understands german and english.

2. determining exactly which metrics, or combination of metrics, to use.
e.g. Kincaid.

3. Certain types of technical email might generate a false spam result. 


So, although sentence length is a clear indicator today, the spammers could
easily work around that by adding random punctuation as well. Still, I think
that using style, or maybe ispell, to add some 2 point increments to a spam
score, one could begin to catch this randomly padded email. 

On a somewhat related note, as someone who calls spamassassin from procmail,
it would be swell if you could pass a spamminess adjustment to spamassassin
and spamc from the command line. That way, if the above was employed, you
could easily add an adjustment if you knew, for example, that the mail
message was part of a foreign language or technical news group. Sure, you
could do it some other way, but that would involve much more complex
procmail rules.

Anyway, I hope this give someone an idea. I'm not a perl programmer, so if
anyone thinks this idea has merit, please feel free to run with it, because
I'm sure someone with more experience could implement something faster than
I could.

-- 
Harry Waddell
Caravan Electronic Publishing
-----------

p.s. this message with signature Kincaid = 5.2




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to