Michael Scheidell mused: >> would adding 1 point for each 1K of header length help?
J.D. Falk responded: > Interesting idea! I don't know the precise semantics of the > contents of that header, but this certainly sounds possible. Seconded. I don't think this is efficient at all (I'm leaning on "no way in hell"), but here's a REALLY UGLY implementation: # header BIG_HEADER ALL =~ /(?=.{1024})/m # tflags BIG_HEADER multiple # DO NOT USE THIS # score BIG_HEADER 0.001 # meta 1KB_HEADER MS_BIG_HEADER # score 1KB_HEADER 0.999 Side note: "(?=pattern)" is a zero-width positive look-ahead assertion that theoretically limits the memory dedicated to this operation. An even less efficient but otherwise equivalent version could more simply just say /.{1024}/m While it "works" (it adds 1.000 for the first 1K and 0.001 for each subsequent character, i.e. 1.024 for each subsequent 1K of header length) it *should not be used*; CPU-efficiency aside, you'll see a BIG_HEADER for EVERY character beyond 1023 in each header, which can make for a pretty darn long summary in your headers and in the body's content analysis details. Also, it doesn't add additional 1KB_HEADER hits for additional big headers. Instead, I've added the following to my sandbox: # Idea from Michael Scheidell of SECNAP on SA Users List 2010-05-08 # http://old.nabble.com/yahoo-X-YMail-OSG-tp28496110p28496110.html header SINGLE_HEADER_1K ALL:raw =~ /^(?!X-Spam|X-MailScan)(?=.{1024,2047}$)/m header SINGLE_HEADER_2K ALL:raw =~ /^(?=.{2048,3071}$)/m header SINGLE_HEADER_3K ALL:raw =~ /^(?=.{3072,4095}$)/m header SINGLE_HEADER_4K ALL:raw =~ /^(?=.{4096,5119}$)/m header SINGLE_HEADER_5K ALL:raw =~ /^(?=.{5120})/m describe SINGLE_HEADER_1K A single header contains 1K-2K characters describe SINGLE_HEADER_2K A single header contains 2K-3K characters describe SINGLE_HEADER_3K A single header contains 3K-4K characters describe SINGLE_HEADER_4K A single header contains 4K-5K characters describe SINGLE_HEADER_5K A single header contains 5K+ characters header BIG_HEADERS_2K ALL:raw =~ /^(?=.{2048,3071}$)/s header BIG_HEADERS_3K ALL:raw =~ /^(?=.{3072,4095}$)/s header BIG_HEADERS_4K ALL:raw =~ /^(?=.{4096,5119}$)/s header BIG_HEADERS_5K ALL:raw =~ /^(?=.{5120})/s tflags BIG_HEADERS_2K userconf # unset trusted_networks can mess this up describe BIG_HEADERS_2K Headers contain 2K-3K characters total describe BIG_HEADERS_3K Headers contain 3K-4K characters total describe BIG_HEADERS_4K Headers contain 4K-5K characters total describe BIG_HEADERS_5K Headers contain 5K+ characters total I could have made each SINGLE_HEADER_xK rule 'tflags multiple' so that it penalizes for multiple egregiously large headers, but I think the BIG_HEADERS_xK rules better facilitate that. I also suspect that the SINGLE_HEADER_xK rules won't be as useful (which isn't to say either will be useful ... we just have to wait for a few reasonably sized masscheck corpora to see, which might take a while these days!).