This just an idea, in the tradition of 'I've got a good idea and hope someone else will carry it through'. I don't expect it, but thought I'd throw it in :)
I've noticed a lot of spam which tries to dilute scanners by including a lot of strings of random characters put together as words, or real words strung together. for example : ibkcd ngf dfvfjq While generated randomly, this has some un-phonemic bits : 'fv' , 'jq' etc; even though its generated randomly, known unphonemic two letter sequences must turn up quite frequently (while you wouldn't actually trigger from whole random words such as 'dfvfjq'). Presumably, Bayesian approaches might pick these up automatically, but an intelligent approach would probably be more efficient. Perhaps even using ideas about phonemes and how they fit together (which I'm not familiar with). Further, there are spams which try to include a random sequence of known words, for example : dielectric accuse headlight berry onrushing rutherford congeal hepburn quadrature oat exegete desolate cushion pancake This seems to be 'heavy' with nouns and adjectives, with only one verb, accuse. It might be possible to identify ungrammatic sentences; of course, there's a lot of CPU involved in this, and an adjunct dictionary. You could perhaps assess semantic distance, too ; for example, it makes sense to 'cut paper' but not to 'cut envy' ; there would be some semantic distance between 'headlight' and 'berry' above. Again, this would cost a lot of CPU. I'm not on the list, please reply to me as well. I've done a search for 'phonemes', and only found some discussion of people talking about phoneme substitutions of words which were meant to be communicated; I'm talking here about noise identification, and there do not seem to be any postings of this nature readily searched. -- Democracy is a bitch if you're in the 49%. John August ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk