On Thu, 2019-08-29 at 11:10 -0700, John Hardin wrote: > On Thu, 29 Aug 2019, Matus UHLAR - fantomas wrote: > > > > On Wed, 28 Aug 2019, Samy Ascha wrote: > > > > Today, I encountered, for the first time, an issue with scanning > > > > an email > > > > that is composed in Spanish. > > > > > > > > It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and > > > > DRUGS_ERECTILE_OBFU rules matches. > > > > > > > > I'm generally looking for a way to manipulate these edge cases, > > > > where > > > > languages are likely to match rules assuming English for the > > > > body text. > > > > > > > > Is there any best-practice for this? I'm sure this happens in > > > > others' > > > > networks, but I'm totally unsure on how to best resolve this. > > > > > > > > Anything in the way of configuration to combat this, e.g. by > > > > combining > > > > language detection with other tags? > > > > > > > > Or, should I look into writing my own plugin to do something > > > > similar? > > > > On 28.08.19 07:48, John Hardin wrote: > > > Generally the approach is to add an exclusion for the specific > > > valid > > > non-english word to the rule itself. > > > > imho the best approach would be excluding hitting exact word for > > valid > > language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for > > languages where > > it's written this way > > Exactly. > > > but that needs deeper logic... > > And a familiarity with potentially many languages... > For detecting spam of this type (pushing unwanted products including financial stuff, cosmetics, ....) I get good results from a slightly more complex type of rule rather like this
describe FINANCIAL_SPAM Unwanted finance offers body __FS1 /(cheap|low interest|....)/ body __FS2 /(credit|loan|mortgage|...)/ meta FINANCIAL_SPAM (__FS1 && __FS2) score FINANCIAL_SPAM .... which can be scored quite high because it only triggers if both subrules match and, with carefully chosen lists of come-on phrases and product names it doesn't generate many false positives simply because the combination is a specific spam marker while any of the terms by themselves are not. Better yet, this type of rule can validly hit on combinations of come-on phrase and product name you hadn't seen when you set the rule up. Once loaded, the overhead of using even rather long lists of alternates in the subrules is low. The main disadvantage is that any list thats more than 10 items or so becomes a pain to edit because SA requires the entire regex to be on a single line, so I wrote a simple script (using only bash and awk) that generates validly constructed rules from test files that are easy to edit by design. If you're interested, you can download the script and documentation from here: http://www.libelle-systems.c3487738.myzen.co.uk/free/portmanteau/portmanteau.tgz Martin