Marcin Mirosław wrote:
W dniu 2017-03-14 16:23, Kris Deugau napisał(a):
If I read the information flow correctly, this is actually decided by
seek-phrases-in-log, which spits out subrules that reached a certain
hit rate in blocks, followed by the "# passed hit-rate threshold nnn"
line. mk_meta_rule_scores just takes that in, collects the rule names
in each block, and spits out the meta.


I made some tests and watch how output looks to understand how some
paremeters works. Meseems that "--reqhitrate" works in this way:
a) if --reqhitrate contains only one value then output od
seek-phrases-in-log contains only rules that hits more than value passed
to --reqhitrate. So this cuts off rules that are hitted rarely

b) if --reqhitrate contains more than one value then:
<high cut off level equal to higher value passed to --reqhitrate> rules
<second value> other rules <low cut off>
example:
--reqhitrate "70 10 1" gives:
<100%> - no rules here - <70%> - rules that matches less than 70% of
spam - <10%> rules that matches less than 10% of spam and more than 1% -
<1%> - cut off, no rules here

You got me curious about exactly what this means, and we're both right, just describing it differently.

For this usage, seek-phrases-in-log does roughly the following (for multiple values to --reqhitrate):

for each pattern
  determine the percentage of spam it hits
  discard the pattern if it's less than the lowest threshold
sort patterns by hit percentage, highest first
for each pattern
  if the hit rate has passed the next threshold
    print flag line "# passed hit-rate threshold <threshold>"
    advance to next threshold
  print a line "# 1.0 <percentage> 0"
  print the subrule

So if you call it with --reqhitrate "50 0.1 5 20 1" on the right set of spam, you might get an intermediate file containing:

body __FOO1 /foo1/
# passed hit-rate threshold: 50
body __FOO2 /foo2/
body __FOO3 /foo3/
# passed hit-rate threshold: 20
body __FOO4 /foo4/
body __FOO5 /foo5/
body __FOO6 /foo6/
# passed hit-rate threshold: 5
body __FOO7 /foo7/
# passed hit-rate threshold: 1
body __FOO8 /foo8/

(Plus the additional "# 1.0 <percentage> 0" comment lines that are just noise at this point.)

mk_meta_rule just takes those groups of rules, separated by the "# passed hit-rate..." lines, and builds FOO_1, FOO_2, F00_3 etc meta rules - 5 of them in this case. mk_meta_rules itself is a pretty simpleminded script; most of the heavy lifting is done in seek-phrases-in-log.

However, if the message data you feed in doesn't separate out to produce at least one rule in each range, between seek-phrases-in-log and mk_meta_rule it will happily create an empty meta rule:

# passed hit-rate threshold: 50
body __FOO1 /foo1/
body __FOO2 /foo2/
body __FOO3 /foo3/
# passed hit-rate threshold: 20
body __FOO4 /foo4/
body __FOO5 /foo5/
body __FOO6 /foo6/
# passed hit-rate threshold: 5
# passed hit-rate threshold: 1
body __FOO7 /foo7/
body __FOO8 /foo8/

With the above output from seek-phrases-in-log, mk_meta_rule will create "meta FOO_1 ()" and "meta FOO_4 ()", since there are no patterns in the first or fourth groups (100% to 50% and 5% to 1%, by your description). It also scores these empty metas at 0 for tidiness - after all, they'll never fire.

In your case, to look at your original question, none of the derived patterns matched more than 70% of the the spam set you fed in, so the _1 rule was empty.

Have you tried to use mk_meta_rule_scores and did I get more values of
scores than two? The default and the value in medium range. I suspect
that mk_meta_rule_scores doesn't play well with ranges. It is something
that I can live with it but if somewhere is bug I would try report it.
If it will not be fixed it can save some time of other users trying to
use this scipt.

Try a set of numbers closer together (eg, "10 7 4 1"), and I'd suggest not using high percentages as it's very unlikely you'll see results in the highest group even with a narrowly targeted set of spam, or if you only have a very small number of nearly identical spams.

-kgd

Reply via email to