On 20 Mar 2019, at 9:04, piecka wrote:
Hello
We've encountered a high false positive rate with MIXED_ES rule for
emails
written in Czech language. Czech naturally uses all of the e,ě and
é.
The situation is similar for Slovak language, which includes e and é.
It seems the same with Greek
(https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).
Email messages written in one of the above mentioned (probably even
other)
languages have a much higher false positive rate than I would consider
acceptable.
I apologize for this: I am the instigator of MIXED_ES, which has done a
good job of catching the extortion spam it was designed from and has an
additional benefit of targeting a generic tactic rather than the moving
target of phrasing. I would very much like to minimize how often it
matches on ham.
Unfortunately, I don't have any examples of FPs, only reports of them.
This makes targeted mitigation very difficult. The Rule QA system has
masscheck reports of a steady but small number of hits on ham, almost
all from a single smallish corpus and no more than one message in any
recent masscheck actually scoring as spam overall.
I've added these lines to the block that defines MIXED_ES which may help
some sites:
lang pl score MIXED_ES 0.01
lang cz score MIXED_ES 0.01
lang sk score MIXED_ES 0.01
lang hr score MIXED_ES 0.01
lang el score MIXED_ES 0.01
Those should get into the default rules channel within a few days.
Additionally, the default score for the rule is 3.999 which is quite
high.
The current score quartet (as determined by the Rule QA system) is
'2.791 2.699 2.791 2.699' and the last time any of those scores was
3.999 was 3 March. If your system is scoring it at 3.999, you should be
running sa-update more often.
Also, I think it should be understood that nearly all SA rules with a
positive score will match some 'ham' messages. These are "false
positives" for the individual rule, but usually they are NOT false
positives for SpamAssassin as a whole.