On Tue, 3 Nov 2020, Loren Wilton wrote:

I'm getting lots of spams that are about 100+K long. The spam body contains two blocks of random news text copied from fox news or msnbc or the like, enclosed in a zero-point font block. I'm trying to match this simple pattern to give some extra points, but I can't seem to get it to work. I'm wondering if there is some buffer limit in SA that is preventing the match from working.

There is.

If I try

  rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*<'s

I don't get a match, even though I know there is a </font> about 50K into the message.

The closing tag is past the end of the cutoff.

But if I try

  rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]*'s

I do get a match. Note all I've done is remove the final "<" from the match text.

If I try

  rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{990,}'s

I get a match.

That's what you should do. Don't try to cut it too close, though, as all the spammer would need to do to bypass that is move the garbage block a little further back in the message. I'd suggest {900} or even {500} - 500 characters of zero-point text in a message body is not plausibly legitimate.

You don't need the "," - it doesn't matter what is there beyond your cutoff, don't waste time matching it. Basic version:

  rawbody LONG_HIDDEN m'<font style="font-size:0px">[^<]{500}'s

You may also want to stick optional whitespace in there to avoid trivial bypass:

  rawbody LONG_HIDDEN m'<font\s+style\s*=\s*"font-size:0px"\s*>[^<]{500}'s

There's also the possibility of adding a typeface or other options to the <font> tag, which would bypass your simple rule. And HTML is not case-sensitive. And avoid * on complex stuff when matching arbitrarily long texts, which can lead to runaway backtracking and scan timeouts.

  rawbody LONG_HIDDEN 
m'<font\s[^>]{0,99}style\s*=\s*"font-size:0px"[^>]{0,99}>[^<]{500}'si

(Caveat: not tested, just off-the-cuff. There's room for improvement in the style spec as well.)


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org                         pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  USMC Rules of Gunfighting #7: In ten years nobody will remember
  the details of caliber, stance, or tactics. They will only remember
  who lived.
-----------------------------------------------------------------------
 Today: the Presidential Election

Reply via email to