This just an idea, in the tradition of 'I've got a good idea and hope 
someone else will carry it through'. I don't expect it, but thought I'd
throw it in :)
I've noticed a lot of spam which tries to dilute scanners by including a lot
of strings of random characters put together as words, or real words strung
together.

for example :

ibkcd ngf dfvfjq

While generated randomly, this has some un-phonemic bits : 'fv' , 'jq' etc;
even though its generated randomly, known unphonemic two letter sequences
must turn up quite frequently (while you wouldn't actually trigger from 
whole random words such as 'dfvfjq').

Presumably, Bayesian approaches might pick these up automatically, but  
an intelligent approach would probably be more efficient. Perhaps
even using ideas about phonemes and how they fit together (which I'm not
familiar with).

Further, there are spams which try to include a random sequence of known 
words, for example :

dielectric accuse headlight berry onrushing rutherford congeal hepburn
quadrature oat exegete desolate cushion pancake

This seems to be 'heavy' with nouns and adjectives, with only one verb,
accuse. It might be possible to identify ungrammatic sentences; of course,
there's a lot of CPU involved in this, and an adjunct dictionary.

You could perhaps assess semantic distance, too ; for example, it makes
sense to 'cut paper' but not to 'cut envy' ; there would be some semantic
distance between 'headlight' and 'berry' above. Again, this would cost
a lot of CPU.

I'm not on the list, please reply to me as well.

I've done a search for 'phonemes', and only found some discussion of people
talking about phoneme substitutions of words which were meant to be 
communicated; I'm talking here about noise identification, and there do not
seem to be any postings of this nature readily searched. 

-- 
Democracy is a bitch if you're in the 49%.

John August



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to