On 20 Mar 2019, at 9:04, piecka wrote:

Hello

We've encountered a high false positive rate with MIXED_ES rule for emails written in Czech language. Czech naturally uses all of the e,ě and é.

The situation is similar for Slovak language, which includes e and é.

It seems the same with Greek
(https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).

Email messages written in one of the above mentioned (probably even other)
languages have a much higher false positive rate than I would consider
acceptable.

I apologize for this: I am the instigator of MIXED_ES, which has done a good job of catching the extortion spam it was designed from and has an additional benefit of targeting a generic tactic rather than the moving target of phrasing. I would very much like to minimize how often it matches on ham.

Unfortunately, I don't have any examples of FPs, only reports of them. This makes targeted mitigation very difficult. The Rule QA system has masscheck reports of a steady but small number of hits on ham, almost all from a single smallish corpus and no more than one message in any recent masscheck actually scoring as spam overall.

I've added these lines to the block that defines MIXED_ES which may help some sites:

    lang pl  score MIXED_ES  0.01
    lang cz  score MIXED_ES  0.01
    lang sk  score MIXED_ES  0.01
    lang hr  score MIXED_ES  0.01
    lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.

Additionally, the default score for the rule is 3.999 which is quite high.

The current score quartet (as determined by the Rule QA system) is '2.791 2.699 2.791 2.699' and the last time any of those scores was 3.999 was 3 March. If your system is scoring it at 3.999, you should be running sa-update more often.

Also, I think it should be understood that nearly all SA rules with a positive score will match some 'ham' messages. These are "false positives" for the individual rule, but usually they are NOT false positives for SpamAssassin as a whole.

Reply via email to