Karl Auer wrote:
On Tue, 2006-11-14 at 09:58 -0500, Peter H. Lemieux wrote:
< body  __HAS_PENETRATION                       /\bpenetration\b/i

I think a lot of rules would be better for losing the word boundaries.
Very few of the worst "four letter words", are ever legitimate
substrings, either.

I generally agree, Karl. In this particular instance I was suggesting a patch to the 70_sare_adult ruleset and was following the patterns the maintainer used for similar rules.

OTOH, I've had FP problems with simple word searches that don't include word boundaries. A word like "sex" can match "sextuplets" or "Middlesex". (The latter case brought this quickly to my attention some years ago when I first starting writing my own SA rules. Middlesex is a county here in Massachusetts.) It's often hard to imagine all the possible false positives that might arise from a particular string, so I can understand why the publicly-distributed rulesets like those from SARE are so careful about word boundaries.

Peter

Reply via email to