On Feb 14, 2014, at 11:00 AM, Adam Katz <antis...@khopis.com> wrote:

> Given the nature of the content, I'd go the other direction and not require 
> the word boundary.  This removes the wildcard, though it doesn't short 
> circuit as quickly, so one could debate which version is more efficient.
> body      __HEXHASHWORD   /\b[a-z]{1,10}\s[0-9a-f]{30}/
> tflags    __HEXHASHWORD   multiple maxhits=5
> meta      HEXHASH_WORD    __HEXHASHWORD > 4
> describe  HEXHASH_WORD    Five hexadecimal hashes, each following a word
> I'm curious if the hex string is always so similar; it may be enough to use  
> \bb8b177bf24975  and not need the tflags multiple piece.
The hex string is not always that similar; I've had similar spams with 
completely different strings.  The same string is repeated multiple times per 
email, but it's different in each email.  I would not hardcode the hex string 
at all.

The main issue I have with the code above, or any tflags=multiple code, is that 
it doesn't require the _same_ hex string, just _any_ 5 hex strings within an 
email.  Granted, the emails where that appears are likely to be spam, but they 
may not necessarily be.  I think forcing the repetition check is important, 
although the only good way to do that is with backreferences (as I sent a day 
or two ago) and that is likely a CPU hog.

Another problem with the above code is that you require only a short word (1-10 
chars) prior to the hex string.  Some perfectly legitimate, or even 
illegitimate, words could be longer than 10 chars.  I'd increase the upper 
limit to something like 15ish, but, per above, I think the potential for FPs is 
reasonably high here.

Cheers.

--- Amir

Reply via email to