Giampaolo Tomassoni wrote:
Hello everybody! I'm going to propose you another great idea which will probably radically change the spam-detection technics. No, come one: I'm just kitting. :) I think this "idea" could eventually help in better detecting the kind of spam in which some words are "garbled" in order to deceive their detection. Some of you probably already know that there exists alghoritms devoted to detecting the language in which a text is written. I just discovered the paper in http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf , which by the way says that such detectors are already available as Perl modules in CPAN (see chapter 7). The idea is that, applying this alghoritms to the text in a message, one could eventually obtain the probability that the given text is written in a given language. Let say that a text is written in english, then these perl routines should yield a high probability that the given text is english. Now, say that some of the words in that text are somehow "scrambled". The language detectors would probably decrease the probability that the text is in english but, assuming the words are randomly scrambled, the probability that the text is in another language wouldn't increase, too. Now, we could apply some thresholding to language scores such that, when the score of the probable language is below a given threshold above the mean of the language scores, then we could say that the message contains some "scrambled worlds" and apply a penalty score to it. I know there are scores for scrambled versions of words like "cialis", but this method would be more solid with respect to non-english languages: I'm from Italy, and I'm used to see some FPs on italian words like "via galileo" as being a scrambled version of "viagra". Also, attempting to collect all the good versions of spam words is expensive in terms of effort. Please note that: - language decoding doesn't (actually) work for ideomatic languages (chinese, japanese, korean and such); - I didn't even have a run of the language decoding modules; - a message written in many (> 3, 4?) languages may probably trigger the penalty score. I'm just trying to see if such an idea seems definitely "broken" to you, as well as if anybody did altready try to run into this. Regards, Giampaolo
Sounds interesting to me. You would want to apply the test if there were a minimum amount of test. You could create a language called "spam" which uses misspelled version of viagra, ciallis, and other words spammers deliberately misspell. Might be worth looking into.