Giampaolo Tomassoni wrote:
Hello everybody!

I'm going to propose you another great idea which will probably radically
change the spam-detection technics.
        
No, come one: I'm just kitting. :) I think this "idea" could eventually help
in better detecting the kind of spam in which some words are "garbled" in
order to deceive their detection.

Some of you probably already know that there exists alghoritms devoted to
detecting the language in which a text is written. I just discovered the
paper in http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf , which by
the way says that such detectors are already available as Perl modules in
CPAN (see chapter 7).

The idea is that, applying this alghoritms to the text in a message, one
could eventually obtain the probability that the given text is written in a
given language. Let say that a text is written in english, then these perl
routines should yield a high probability that the given text is english.
Now, say that some of the words in that text are somehow "scrambled". The
language detectors would probably decrease the probability that the text is
in english but, assuming the words are randomly scrambled, the probability
that the text is in another language wouldn't increase, too. Now, we could
apply some thresholding to language scores such that, when the score of the
probable language is below a given threshold above the mean of the language
scores, then we could say that the message contains some "scrambled worlds"
and apply a penalty score to it.

I know there are scores for scrambled versions of words like "cialis", but
this method would be more solid with respect to non-english languages: I'm
from Italy, and I'm used to see some FPs on italian words like "via galileo"
as being a scrambled version of "viagra". Also, attempting to collect all
the good versions of spam words is expensive in terms of effort.

Please note that:

 - language decoding doesn't (actually) work for ideomatic languages
(chinese, japanese, korean and such);

 - I didn't even have a run of the language decoding modules;

 - a message written in many (> 3, 4?) languages may probably trigger the
penalty score.

I'm just trying to see if such an idea seems definitely "broken" to you, as
well as if anybody did altready try to run into this.

Regards,

Giampaolo


Sounds interesting to me. You would want to apply the test if there were a minimum amount of test. You could create a language called "spam" which uses misspelled version of viagra, ciallis, and other words spammers deliberately misspell. Might be worth looking into.

Reply via email to