Re: double letter porn

NFN Smith Wed, 18 Oct 2006 08:06:00 -0700

Richard Doyle wrote:

I've been getting lots of porn site spam containing words with doubled
letters, like this one:

I was looking at this one yesterday, and thought of a differentapproach. It may be a little kludgy, but it seems to work on some basictests.

For this, I'm starting with a list of words that are commonly misspelledwith double characters.

I start with a rule that looks for these words, with correct spelling,and score a hit with 0.01 points. Call this the strict rule.

I then do a second rule that looks for the same words, but with regexpwildcarding that looks for the pattern characters in the word, but has apositive, if there's other stuff there -- either b*a*d*w*o*r*d orb.?a.?d.?w.?o.?r.?d. A hit on this rule generates a very high score,say 100 points. Call this the loose rule.

Finally, I create a meta rule that includes both the strict rule and theloose rule. If I get a hit there (that is, where I have hits on boththe other rules), it means that the word is correctly spelled, and hitthe metarule generates a negative value of whatever score was applied tothe loose rule.

If only the loose rule is hit, then the word is misspelled (presumablydeliberately), and the high score is retained.

I haven't yet tested how this approach works on messages that may havemultiple words that are deliberately misspelled, but with just a singleword and basic testing, I'm pleased with the initial results. Inparticular, this seems to allow me to accept words that are oftenlegitimate when correctly spelled, but have high probability of spam(and likely offensive) if misspelled.

Can anybody who has more experience in this area tell me of potentialproblems to this approach?


Smith

Re: double letter porn

Reply via email to