Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Martin Gregorie Thu, 29 Aug 2019 11:58:18 -0700

On Thu, 2019-08-29 at 11:10 -0700, John Hardin wrote:
> On Thu, 29 Aug 2019, Matus UHLAR - fantomas wrote:
> 
> > > On Wed, 28 Aug 2019, Samy Ascha wrote:
> > > > Today, I encountered, for the first time, an issue with scanning
> > > > an email 
> > > > that is composed in Spanish.
> > > > 
> > > > It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and 
> > > > DRUGS_ERECTILE_OBFU rules matches.
> > > > 
> > > > I'm generally looking for a way to manipulate these edge cases,
> > > > where 
> > > > languages are likely to match rules assuming English for the
> > > > body text.
> > > > 
> > > > Is there any best-practice for this? I'm sure this happens in
> > > > others' 
> > > > networks, but I'm totally unsure on how to best resolve this.
> > > > 
> > > > Anything in the way of configuration to combat this, e.g. by
> > > > combining 
> > > > language detection with other tags?
> > > > 
> > > > Or, should I look into writing my own plugin to do something
> > > > similar?
> > 
> > On 28.08.19 07:48, John Hardin wrote:
> > > Generally the approach is to add an exclusion for the specific
> > > valid 
> > > non-english word to the rule itself.
> > 
> > imho the best approach would be excluding hitting exact word for
> > valid
> > language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for
> > languages where
> > it's written this way
> 
> Exactly.
> 
> > but that needs deeper logic...
> 
> And a familiarity with potentially many languages...
> 
For detecting spam of this type (pushing unwanted products including
financial stuff, cosmetics, ....) I get good results from a slightly
more complex type of rule rather like this


describe  FINANCIAL_SPAM  Unwanted finance offers
body      __FS1           /(cheap|low interest|....)/
body      __FS2           /(credit|loan|mortgage|...)/
meta      FINANCIAL_SPAM  (__FS1 && __FS2)
score     FINANCIAL_SPAM  ....

which can be scored quite high because it only triggers if both subrules
match and, with carefully chosen lists of come-on phrases and product
names it doesn't generate many false positives simply because the
combination is a specific spam marker while any of the terms by
themselves are not. Better yet, this type of rule can validly hit on
combinations of come-on phrase and product name you hadn't seen when you
set the rule up. Once loaded, the overhead of using even rather long
lists of alternates in the subrules is low.

The main disadvantage is that any list thats more than 10 items or so
becomes a pain to edit because SA requires the entire regex to be on a
single line, so I wrote a simple script (using only bash and awk) that
generates validly constructed rules from test files that are easy to
edit by design. If you're interested, you can download the script and
documentation from here:
http://www.libelle-systems.c3487738.myzen.co.uk/free/portmanteau/portmanteau.tgz


Martin

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Reply via email to