On Thu, 30 Jan 2014, Amir Caspi wrote:

On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <kmcgr...@pccc.com> wrote:

If you want to share the complete rule, I can throw it into my sandbox and see 
what masscheck thinks as well.

The complete rule would be something like this, assuming Andy implemented it as 
I wrote it:

rawbody HTML_NONSENSE_TAGS      /(?:<[A-Za-z0-9]{4,}>\s*){10,}/
describe HTML_NONSENSE_TAGS     Many consecutive multi-letter HTML tags, likely 
nonsense/spam
score HTML_NONSENSE_TAGS        0.001

Score to be adjusted as needed, of course.

I'd suggest writing it as a subrule first, to see how well it performs against the masscheck corpora. If it does well by itself (good hits, high S/O), then a meta can be added to expose it for scoring. If it hits a lot but the S/O ratio is low, then it could be analyzed for possible combinations with other rules to get something that performs well.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  For those who are being swayed by Microsoft's whining about the
  GPL, consider how aggressively viral their Shared Source license is:
  If you've *ever* seen *any* MS code covered by the Shared Source
  license, you're infected for life. MS can sue you for Intellectual
  Property misappropriation whenever they like, so you'd better not
  come up with any Innovative Ideas that they want to Embrace...
-----------------------------------------------------------------------
 2 days until the 11st anniversary of the loss of STS-107 Columbia

Reply via email to