Marcin Mirosław wrote:
W dniu 2017-03-14 16:23, Kris Deugau napisał(a):
If I read the information flow correctly, this is actually decided by
seek-phrases-in-log, which spits out subrules that reached a certain
hit rate in blocks, followed by the "# passed hit-rate threshold nnn"
line. mk_meta_rule_scores just takes that in, collects the rule names
in each block, and spits out the meta.
I made some tests and watch how output looks to understand how some
paremeters works. Meseems that "--reqhitrate" works in this way:
a) if --reqhitrate contains only one value then output od
seek-phrases-in-log contains only rules that hits more than value passed
to --reqhitrate. So this cuts off rules that are hitted rarely
b) if --reqhitrate contains more than one value then:
<high cut off level equal to higher value passed to --reqhitrate> rules
<second value> other rules <low cut off>
example:
--reqhitrate "70 10 1" gives:
<100%> - no rules here - <70%> - rules that matches less than 70% of
spam - <10%> rules that matches less than 10% of spam and more than 1% -
<1%> - cut off, no rules here
You got me curious about exactly what this means, and we're both right,
just describing it differently.
For this usage, seek-phrases-in-log does roughly the following (for
multiple values to --reqhitrate):
for each pattern
determine the percentage of spam it hits
discard the pattern if it's less than the lowest threshold
sort patterns by hit percentage, highest first
for each pattern
if the hit rate has passed the next threshold
print flag line "# passed hit-rate threshold <threshold>"
advance to next threshold
print a line "# 1.0 <percentage> 0"
print the subrule
So if you call it with --reqhitrate "50 0.1 5 20 1" on the right set of
spam, you might get an intermediate file containing:
body __FOO1 /foo1/
# passed hit-rate threshold: 50
body __FOO2 /foo2/
body __FOO3 /foo3/
# passed hit-rate threshold: 20
body __FOO4 /foo4/
body __FOO5 /foo5/
body __FOO6 /foo6/
# passed hit-rate threshold: 5
body __FOO7 /foo7/
# passed hit-rate threshold: 1
body __FOO8 /foo8/
(Plus the additional "# 1.0 <percentage> 0" comment lines that are just
noise at this point.)
mk_meta_rule just takes those groups of rules, separated by the "#
passed hit-rate..." lines, and builds FOO_1, FOO_2, F00_3 etc meta rules
- 5 of them in this case. mk_meta_rules itself is a pretty simpleminded
script; most of the heavy lifting is done in seek-phrases-in-log.
However, if the message data you feed in doesn't separate out to produce
at least one rule in each range, between seek-phrases-in-log and
mk_meta_rule it will happily create an empty meta rule:
# passed hit-rate threshold: 50
body __FOO1 /foo1/
body __FOO2 /foo2/
body __FOO3 /foo3/
# passed hit-rate threshold: 20
body __FOO4 /foo4/
body __FOO5 /foo5/
body __FOO6 /foo6/
# passed hit-rate threshold: 5
# passed hit-rate threshold: 1
body __FOO7 /foo7/
body __FOO8 /foo8/
With the above output from seek-phrases-in-log, mk_meta_rule will create
"meta FOO_1 ()" and "meta FOO_4 ()", since there are no patterns in the
first or fourth groups (100% to 50% and 5% to 1%, by your description).
It also scores these empty metas at 0 for tidiness - after all, they'll
never fire.
In your case, to look at your original question, none of the derived
patterns matched more than 70% of the the spam set you fed in, so the _1
rule was empty.
Have you tried to use mk_meta_rule_scores and did I get more values of
scores than two? The default and the value in medium range. I suspect
that mk_meta_rule_scores doesn't play well with ranges. It is something
that I can live with it but if somewhere is bug I would try report it.
If it will not be fixed it can save some time of other users trying to
use this scipt.
Try a set of numbers closer together (eg, "10 7 4 1"), and I'd suggest
not using high percentages as it's very unlikely you'll see results in
the highest group even with a narrowly targeted set of spam, or if you
only have a very small number of nearly identical spams.
-kgd