Richard Doyle wrote:
I've been getting lots of porn site spam containing words with doubled
letters, like this one:


I was looking at this one yesterday, and thought of a different approach. It may be a little kludgy, but it seems to work on some basic tests.

For this, I'm starting with a list of words that are commonly misspelled with double characters.

I start with a rule that looks for these words, with correct spelling, and score a hit with 0.01 points. Call this the strict rule.

I then do a second rule that looks for the same words, but with regexp wildcarding that looks for the pattern characters in the word, but has a positive, if there's other stuff there -- either b*a*d*w*o*r*d or b.?a.?d.?w.?o.?r.?d. A hit on this rule generates a very high score, say 100 points. Call this the loose rule.

Finally, I create a meta rule that includes both the strict rule and the loose rule. If I get a hit there (that is, where I have hits on both the other rules), it means that the word is correctly spelled, and hit the metarule generates a negative value of whatever score was applied to the loose rule.

If only the loose rule is hit, then the word is misspelled (presumably deliberately), and the high score is retained.

I haven't yet tested how this approach works on messages that may have multiple words that are deliberately misspelled, but with just a single word and basic testing, I'm pleased with the initial results. In particular, this seems to allow me to accept words that are often legitimate when correctly spelled, but have high probability of spam (and likely offensive) if misspelled.

Can anybody who has more experience in this area tell me of potential problems to this approach?

Smith



Reply via email to