On Thu, 06 Nov 2003 10:58:59 -0800, Greg Webster <[EMAIL PROTECTED]> posted to spamassassin-talk: > A thought on spammers oft-used sets of 'random' character lists in > emails...an example: > > gnqplleqhzblll > u > wfjmvfe upvxoi lwhm > xqs > flckwrtsmufx irwajksqsnw er wcfjgfmk jugxfq
Have you looked into analyzing these using the language recognizer which is included with SpamAssassin? I posted about this a while ago (a thread with "consonants" in the title IIRC) and there were other messages in that thread which specifically looked at regex-based approaches to this. > Some potential concerns: > - Encoded messages will likely set this off (uuencode, binhex, etc.) I believe you could recognize those with the language recognizer as well. > - Are there many legitimate situations where 5+ consonents will be seen? As somebody already posted, there are words even in a regular English dictionary which have more than five consecutive consonants. Also consider people who have reason to write "aarrrrrrrrgggggggghhhhhhhh" (maybe because they use Windows, but forgive them, for they know not what they are doing :^) or acronyms like DSSSL. > - Will other languages (such as German and Welsh with long strings of > consonents) be penalized for using this? ^ (That'd be "consonants".) Perhaps a meta rule which pairs the language code with the rule to use would be in place. I don't know if that's readily doable with the current SpamAssassin, though. In English, compounds like "Knightsbridge" (pro Knight's Bridge) are rather the exception, but many languages tend to write compounds together, as a rule. So if a word which ends in a consonant cluster is compounded with a word which begins with a consonant cluster, you can get fairly long sequences of consonants. And then in some languages, even consonants can be syllable cores, so a word might not have a vowel at all (Slavic languages are known for this -- for kicks, see the Czech tongue twister at <http://www.uebersetzung.at/twister/> ... also have a look at Swedish #28 with eight consonants in a row. I believe this is even quite normal in German and Dutch :-) Rather than focus on individual hits, I'd look for sequences where the distribution of letter pairs differs significantly from any known language. A cheap but inexact solution is the language identifier in SA (actually borrowed from a separate effort; it's called TextCat) but you could also develop more refined techniques for this particular application. (Incidentally, I've found that the language models which are borrowed from TextCat into SpamAssassin aren't exactly very precise. For one thing, long sequences of gunk tend to throw off the analyzer. For example, the string "...................." gets classified as Russian ... and the empty string gets classified as Dutch!) Let's say you have statistics which tell you which sequences are common in those languages you care to receive e-mail in. So you start out with a score of, say, ten, and deduct one point for each unusual sequence. If you look at the sequence "gnqplleqhzblll" above, and split it up into trigrams (sequences of three characters each, with a sliding window moving over the string one character at a time), you'd get overlapping unusual sequences all the way (maybe add a higher score if they're overlapping or adjacent, even) except at "lle" - "leg" and maybe "gpl" (think GNU General Public Licence :-) and so this sequence alone would trip the score down below zero. (Real scoring systems tend to add scores for "good" and deduct for "bad" and have a gray area where you don't change the score because you don't know which way the case should be handled. This is in principle fairly similar to how the Bayesian classifier works, BTW.) /* era */ -- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 500 pieces of spam for want to reach me, see instead. each wanted message. ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk