Matus UHLAR - fantomas wrote: > On 09.08.09 11:33, Cedric Knight wrote: >> I'm using Bayes and network tests, and have found a few rules with a >> good ratio of ham to spam, but that score only 0.001 in the default rules. > > apparently there's no use for them alone and the score isn't 0 just because > that would cause them not to be processed.
OK, but why are they determined to be useless? Some do have scores in other scoresets, but not in scoreset 3 (Bayes + network). Is it a consequence of score generation, based purely on ratio, or is it set manually? (In the cases where the rule scores 0.001 on all scoresets, I assume it is pegged manually.) > >> Here are the ones I'm talking about: >> >> FH_HELO_EQ_D_D_D_D >> >> Overlaps with HELO_DYNAMIC_IPADDR2 and TVD_RCVD_IP, > > this is a big problem Imho, I've even filled a bugreport because of this It does mean when you get FPs, they are serious FPs. But if the rules were reorganised so they didn't overlap (using meta rules or assertions in the pattern like those below), I think there's something to be gained from scoring each case separately. Or, if rule A and rule B overlap such that B hits fewer spam, but has a better ratio, then I'd say both should be active and score a positive value: it's just that the score for B should be less if A will also hit. (If B hit only a subset of A's spam and had a *worse* ratio, then yes, it should be scored near 0.) Does score generation currently take this approach? > >> but if you redefine it as >> >> header FH_HELO_EQ_D_D_D_D X-Spam-Relays-Untrusted =~ /^[^\]]+ >> helo=(?!(?:[a-z]\S*)?\d+[^\d\s]\d+[^\d\s]\d+[^\d\s]\d+[^\d\s][^\.]*\.\S+\.\S+[^\]]+ >> auth= )[^ ]{0,15}\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3}/ >> >> it clearly hits a good number that are missed by the other rules, with a >> similar ratio. > > it would match for every host send from generic IP address (if they know the > address and it's rdns) , which is very common for dsl,cable,dialup etc > users. Hmmm... (a) in most cases, hosts on a dynamic address will use a machine name rather than their rDNS; (b) is it not an assumption that these will not be connecting directly to the MTA unless they are trusted? Cf RCVD_IN_SORBS_DUL, RCVD_IN_PBL, DOS_OE_TO_MX (although these depend on last external, rather than last untrusted). Anyway, my point is that *empirically* this rule seems to do well, testing against a sample over the past week, even if we exclude mail that also hits HELO_DYNAMIC_IPADDR2 and TVD_RCVD_IP. If I search through recent hits, it's all botnet stuff, and the ratio last week was similar to URIBL_JP_SURBL (~0.1% ham hits). If it doesn't do as well for someone else, maybe it's down to some interesting difference in setup, e.g. using greylisting. A score of 0.8 or 1.0 seems to work well for me. >> FH_HOST_EQ_VERIZON_P >> Being based in the UK, don't have many dealings with Verizon customers, >> so YMMV on this one. Still, only around 0.2% of hits are ham. > > you should understand that SA has many users living in a country with many > verizon customers and the rules should be done tht they could be used > generally Viz the US. Certainly SpamAssassin should be as widely usable as possible, although there are problems with non-Western character sets... OK, I withdraw my suggestion about this one as it relies purely on factor (b) above, but still think the others are worth a go. CK