On Feb 14, 2014, at 11:00 AM, Adam Katz <antis...@khopis.com> wrote: > Given the nature of the content, I'd go the other direction and not require > the word boundary. This removes the wildcard, though it doesn't short > circuit as quickly, so one could debate which version is more efficient. > body __HEXHASHWORD /\b[a-z]{1,10}\s[0-9a-f]{30}/ > tflags __HEXHASHWORD multiple maxhits=5 > meta HEXHASH_WORD __HEXHASHWORD > 4 > describe HEXHASH_WORD Five hexadecimal hashes, each following a word > I'm curious if the hex string is always so similar; it may be enough to use > \bb8b177bf24975 and not need the tflags multiple piece. The hex string is not always that similar; I've had similar spams with completely different strings. The same string is repeated multiple times per email, but it's different in each email. I would not hardcode the hex string at all.
The main issue I have with the code above, or any tflags=multiple code, is that it doesn't require the _same_ hex string, just _any_ 5 hex strings within an email. Granted, the emails where that appears are likely to be spam, but they may not necessarily be. I think forcing the repetition check is important, although the only good way to do that is with backreferences (as I sent a day or two ago) and that is likely a CPU hog. Another problem with the above code is that you require only a short word (1-10 chars) prior to the hex string. Some perfectly legitimate, or even illegitimate, words could be longer than 10 chars. I'd increase the upper limit to something like 15ish, but, per above, I think the potential for FPs is reasonably high here. Cheers. --- Amir