John August said: > This just an idea, in the tradition of 'I've got a good idea and > hope > someone else will carry it through'. I don't expect it, but thought > I'd > throw it in :) > > I've noticed a lot of spam which tries to dilute scanners by > including a lot > of strings of random characters put together as words, or real words > strung > together. > > for example : > > ibkcd ngf dfvfjq > > While generated randomly, this has some un-phonemic bits : 'fv' , > 'jq' etc; > even though its generated randomly, known unphonemic two letter > sequences > must turn up quite frequently (while you wouldn't actually trigger > from > whole random words such as 'dfvfjq').
Check out the "tripwire" rules. These tend to catch this type of gibberish quite nicely. See http://www.merchantsoverseas.com/wwwroot/gorilla/99_FVGT_Tripwire.cf > Presumably, Bayesian approaches might pick these up automatically, > but > an intelligent approach would probably be more efficient. Perhaps > even using ideas about phonemes and how they fit together (which I'm > not > familiar with). > > Further, there are spams which try to include a random sequence of > known > words, for example : > > dielectric accuse headlight berry onrushing rutherford congeal > hepburn > quadrature oat exegete desolate cushion pancake > > This seems to be 'heavy' with nouns and adjectives, with only one > verb, > accuse. It might be possible to identify ungrammatic sentences; of > course, > there's a lot of CPU involved in this, and an adjunct dictionary. > > You could perhaps assess semantic distance, too ; for example, it > makes > sense to 'cut paper' but not to 'cut envy' ; there would be some > semantic > distance between 'headlight' and 'berry' above. Again, this would > cost > a lot of CPU. > > I'm not on the list, please reply to me as well. > > I've done a search for 'phonemes', and only found some discussion of > people > talking about phoneme substitutions of words which were meant to be > communicated; I'm talking here about noise identification, and there > do not > seem to be any postings of this nature readily searched. Here's a rule that showed up on the list (sorry, I forgot the authors' names) which penalizes strings of long words without short prepositions, etc: body CP_WORDWORD_10 /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/ describe CP_WORDWORD_10 string of 10+ random words score CP_WORDWORD_10 0.5 body CP_WORDWORD_15 /(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/ describe CP_WORDWORD_15 string of 15+ random words score CP_WORDWORD_15 2.5 Both of these should have one line between "body" and "score"; no line wraps. I also score them at 1.5 and 3.5 instead, though I haven't tested for false positives. -- Kurt Yoder Sport & Health network administrator ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk