On Feb 14, 2014, at 11:00 AM, Adam Katz <antis...@khopis.com <mailto:antis...@khopis.com>> wrote: >> >> Given the nature of the content, I'd go the other direction and not >> require the word boundary. This removes the wildcard, though it >> doesn't short circuit as quickly, so one could debate which version >> is more efficient. >> body __HEXHASHWORD /\b[a-z]{1,10}\s[0-9a-f]{30}/ >> tflags __HEXHASHWORD multiple maxhits=5 >> meta HEXHASH_WORD __HEXHASHWORD > 4 >> describe HEXHASH_WORD Five hexadecimal hashes, each following a word >>
On 02/14/2014 10:12 AM, Amir Caspi wrote: > > The main issue I have with the code above, or any tflags=multiple > code, is that it doesn't require the _same_ hex string, just _any_ 5 > hex strings within an email. Granted, the emails where that appears > are likely to be spam, but they may not necessarily be. I think > forcing the repetition check is important, although the only good way > to do that is with backreferences (as I sent a day or two ago) and > that is likely a CPU hog. > Yes, there is an increased FP risk due to the ability to match different hex strings (e.g. a list of checksums). That's probably where the current Rule QA FPs <http://ruleqa.spamassassin.org/?rule=/HEXHASH> come from. Still, it gets a decent .968 S/O (relative precision <https://en.wikipedia.org/wiki/Precision_and_recall>) with a very small number of FPs (0.0104%). Based on this, this is likely safe to assign a point or so to. If you want to assign a high score (3+), you'd be absolutely correct on needing the full match (though watch for truncation; some of your sample <http://pastebin.com/zCStErch>'s strings had an extra character on the end. This version of the rule is more expensive, but is safer to score higher (maybe 3-4 points): body HEXHASH_WORD_5 /\b[a-z]{1,10}\s([0-9a-f]{30})(?:.{0,99}\b[a-z]{1,10}\s\1){4}/ describe HEXHASH_WORD_5 Five copies of the same hexadecimal hash, each following a word I know you don't have Bayes enabled, but Bayes is the best source of negative points, which is to say that if you had Bayes turned on (and it weren't enough to catch this spam itself), you could rely on negative points from Bayes preventing an FP from exceeding your spam threshold and therefore assign this rule slightly more points. (Be careful with that premise, it doesn't scale; Bayes provides a limited number of negative points and doesn't fire on all ham.) > Another problem with the above code is that you require only a short > word (1-10 chars) prior to the hex string. Some perfectly legitimate, > or even illegitimate, words could be longer than 10 chars. I'd > increase the upper limit to something like 15ish, but, per above, I > think the potential for FPs is reasonably high here. > Your sample did not contain any 7+ character words preceding the long hex string, so broadening that range beyond the three character buffer we've already afforded it merely increases your FP risk (note that there were twelve copies of that string in the sample while the rule only requires five; I figure there will be five 1-10 char words followed by 30-char hex strings). File names can be longer and could therefore become FPs.
signature.asc
Description: OpenPGP digital signature