On Feb 14, 2014, at 11:00 AM, Adam Katz <antis...@khopis.com
<mailto:antis...@khopis.com>> wrote:
>>
>> Given the nature of the content, I'd go the other direction and not
>> require the word boundary.  This removes the wildcard, though it
>> doesn't short circuit as quickly, so one could debate which version
>> is more efficient.
>> body      __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
>> tflags    __HEXHASHWORD   multiple maxhits=5
>> meta      HEXHASH_WORD    __HEXHASHWORD > 4
>> describe  HEXHASH_WORD    Five hexadecimal hashes, each following a word
>>

On 02/14/2014 10:12 AM, Amir Caspi wrote:
>
> The main issue I have with the code above, or any tflags=multiple
> code, is that it doesn't require the _same_ hex string, just _any_ 5
> hex strings within an email.  Granted, the emails where that appears
> are likely to be spam, but they may not necessarily be.  I think
> forcing the repetition check is important, although the only good way
> to do that is with backreferences (as I sent a day or two ago) and
> that is likely a CPU hog.
>

Yes, there is an increased FP risk due to the ability to match different
hex strings (e.g. a list of checksums).  That's probably where the
current Rule QA FPs <http://ruleqa.spamassassin.org/?rule=/HEXHASH> come
from.  Still, it gets a decent .968 S/O (relative precision
<https://en.wikipedia.org/wiki/Precision_and_recall>) with a very small
number of FPs (0.0104%).  Based on this, this is likely safe to assign a
point or so to.

If you want to assign a high score (3+), you'd be absolutely correct on
needing the full match (though watch for truncation; some of your sample
<http://pastebin.com/zCStErch>'s strings had an extra character on the end.

This version of the rule is more expensive, but is safer to score higher
(maybe 3-4 points):

body      HEXHASH_WORD_5  
/\b[a-z]{1,10}\s([0-9a-f]{30})(?:.{0,99}\b[a-z]{1,10}\s\1){4}/
describe  HEXHASH_WORD_5  Five copies of the same hexadecimal hash, each 
following a word


I know you don't have Bayes enabled, but Bayes is the best source of
negative points, which is to say that if you had Bayes turned on (and it
weren't enough to catch this spam itself), you could rely on negative
points from Bayes preventing an FP from exceeding your spam threshold
and therefore assign this rule slightly more points.  (Be careful with
that premise, it doesn't scale; Bayes provides a limited number of
negative points and doesn't fire on all ham.)

> Another problem with the above code is that you require only a short
> word (1-10 chars) prior to the hex string.  Some perfectly legitimate,
> or even illegitimate, words could be longer than 10 chars.  I'd
> increase the upper limit to something like 15ish, but, per above, I
> think the potential for FPs is reasonably high here.
>

Your sample did not contain any 7+ character words preceding the long
hex string, so broadening that range beyond the three character buffer
we've already afforded it merely increases your FP risk (note that there
were twelve copies of that string in the sample while the rule only
requires five; I figure there will be five 1-10 char words followed by
30-char hex strings).  File names can be longer and could therefore
become FPs.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to