John August said:
> This just an idea, in the tradition of 'I've got a good idea and
> hope
> someone else will carry it through'. I don't expect it, but thought
> I'd
> throw it in :)
>
> I've noticed a lot of spam which tries to dilute scanners by
> including a lot
> of strings of random characters put together as words, or real words
> strung
> together.
>
> for example :
>
> ibkcd ngf dfvfjq
>
> While generated randomly, this has some un-phonemic bits : 'fv' ,
> 'jq' etc;
> even though its generated randomly, known unphonemic two letter
> sequences
> must turn up quite frequently (while you wouldn't actually trigger
> from
> whole random words such as 'dfvfjq').

Check out the "tripwire" rules. These tend to catch this type of
gibberish quite nicely. See

http://www.merchantsoverseas.com/wwwroot/gorilla/99_FVGT_Tripwire.cf

> Presumably, Bayesian approaches might pick these up automatically,
> but
> an intelligent approach would probably be more efficient. Perhaps
> even using ideas about phonemes and how they fit together (which I'm
> not
> familiar with).
>
> Further, there are spams which try to include a random sequence of
> known
> words, for example :
>
> dielectric accuse headlight berry onrushing rutherford congeal
> hepburn
> quadrature oat exegete desolate cushion pancake
>
> This seems to be 'heavy' with nouns and adjectives, with only one
> verb,
> accuse. It might be possible to identify ungrammatic sentences; of
> course,
> there's a lot of CPU involved in this, and an adjunct dictionary.
>
> You could perhaps assess semantic distance, too ; for example, it
> makes
> sense to 'cut paper' but not to 'cut envy' ; there would be some
> semantic
> distance between 'headlight' and 'berry' above. Again, this would
> cost
> a lot of CPU.
>
> I'm not on the list, please reply to me as well.
>
> I've done a search for 'phonemes', and only found some discussion of
> people
> talking about phoneme substitutions of words which were meant to be
> communicated; I'm talking here about noise identification, and there
> do not
> seem to be any postings of this nature readily searched.

Here's a rule that showed up on the list (sorry, I forgot the
authors' names) which penalizes strings of long words without short
prepositions, etc:

body    CP_WORDWORD_10 
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
describe        CP_WORDWORD_10  string of 10+ random words
score   CP_WORDWORD_10  0.5

body    CP_WORDWORD_15 
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
describe        CP_WORDWORD_15  string of 15+ random words
score   CP_WORDWORD_15  2.5

Both of these should have one line between "body" and "score"; no
line wraps. I also score them at 1.5 and 3.5 instead, though I
haven't tested for false positives.

-- 
Kurt Yoder
Sport & Health network administrator



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to