On Thu, 06 Nov 2003 10:58:59 -0800, Greg Webster <[EMAIL PROTECTED]>
posted to spamassassin-talk:
 > A thought on spammers oft-used sets of 'random' character lists in
 > emails...an example:
 >
 > gnqplleqhzblll
 > u
 >  wfjmvfe upvxoi lwhm
 > xqs 
 > flckwrtsmufx irwajksqsnw er wcfjgfmk jugxfq

Have you looked into analyzing these using the language recognizer
which is included with SpamAssassin? I posted about this a while ago
(a thread with "consonants" in the title IIRC) and there were other
messages in that thread which specifically looked at regex-based
approaches to this.

 > Some potential concerns:
 > - Encoded messages will likely set this off (uuencode, binhex, etc.)

I believe you could recognize those with the language recognizer as well. 

 > - Are there many legitimate situations where 5+ consonents will be seen?

As somebody already posted, there are words even in a regular English
dictionary which have more than five consecutive consonants. Also
consider people who have reason to write "aarrrrrrrrgggggggghhhhhhhh"
(maybe because they use Windows, but forgive them, for they know not
what they are doing :^) or acronyms like DSSSL.

 > - Will other languages (such as German and Welsh with long strings of
 > consonents) be penalized for using this?
         ^
(That'd be "consonants".) Perhaps a meta rule which pairs the language
code with the rule to use would be in place. I don't know if that's
readily doable with the current SpamAssassin, though.

In English, compounds like "Knightsbridge" (pro Knight's Bridge) are
rather the exception, but many languages tend to write compounds
together, as a rule. So if a word which ends in a consonant cluster is
compounded with a word which begins with a consonant cluster, you can
get fairly long sequences of consonants. And then in some languages,
even consonants can be syllable cores, so a word might not have a
vowel at all (Slavic languages are known for this -- for kicks, see
the Czech tongue twister at <http://www.uebersetzung.at/twister/> ...
also have a look at Swedish #28 with eight consonants in a row. I
believe this is even quite normal in German and Dutch :-)

Rather than focus on individual hits, I'd look for sequences where the
distribution of letter pairs differs significantly from any known
language. A cheap but inexact solution is the language identifier in
SA (actually borrowed from a separate effort; it's called TextCat) but
you could also develop more refined techniques for this particular
application.

(Incidentally, I've found that the language models which are borrowed
from TextCat into SpamAssassin aren't exactly very precise. For one
thing, long sequences of gunk tend to throw off the analyzer. For
example, the string "...................." gets classified as Russian
... and the empty string gets classified as Dutch!)

Let's say you have statistics which tell you which sequences are
common in those languages you care to receive e-mail in. So you start
out with a score of, say, ten, and deduct one point for each unusual
sequence. If you look at the sequence "gnqplleqhzblll" above, and
split it up into trigrams (sequences of three characters each, with a
sliding window moving over the string one character at a time), you'd
get overlapping unusual sequences all the way (maybe add a higher
score if they're overlapping or adjacent, even) except at "lle" -
"leg" and maybe "gpl" (think GNU General Public Licence :-) and so
this sequence alone would trip the score down below zero.

(Real scoring systems tend to add scores for "good" and deduct for
"bad" and have a gray area where you don't change the score because
you don't know which way the case should be handled. This is in
principle fairly similar to how the Bayesian classifier works, BTW.)

/* era */

-- 
The email address era     the contact information   Just for kicks, imagine
at iki dot fi is heavily  link on my home page at   what it's like to get
spam filtered.  If you    <http://www.iki.fi/era/>  500 pieces of spam for
want to reach me, see     instead.                  each wanted message.



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to