Michael Scheidell mused:
>> would adding 1 point for each 1K of header length help?

J.D. Falk responded:
> Interesting idea!  I don't know the precise semantics of the
> contents of that header, but this certainly sounds possible.

Seconded.

I don't think this is efficient at all (I'm leaning on "no way in
hell"), but here's a REALLY UGLY implementation:

#    header BIG_HEADER       ALL =~ /(?=.{1024})/m
#    tflags BIG_HEADER       multiple  # DO NOT USE THIS
#    score  BIG_HEADER       0.001
#    meta   1KB_HEADER       MS_BIG_HEADER
#    score  1KB_HEADER       0.999

Side note: "(?=pattern)" is a zero-width positive look-ahead assertion
that theoretically limits the memory dedicated to this operation.  An
even less efficient but otherwise equivalent version could more simply
just say /.{1024}/m

While it "works" (it adds 1.000 for the first 1K and 0.001 for each
subsequent character, i.e. 1.024 for each subsequent 1K of header
length) it *should not be used*; CPU-efficiency aside, you'll see a
BIG_HEADER for EVERY character beyond 1023 in each header, which can
make for a pretty darn long summary in your headers and in the body's
content analysis details.  Also, it doesn't add additional 1KB_HEADER
hits for additional big headers.

Instead, I've added the following to my sandbox:

# Idea from Michael Scheidell of SECNAP on SA Users List 2010-05-08
# http://old.nabble.com/yahoo-X-YMail-OSG-tp28496110p28496110.html
header   SINGLE_HEADER_1K  ALL:raw =~
/^(?!X-Spam|X-MailScan)(?=.{1024,2047}$)/m
header   SINGLE_HEADER_2K  ALL:raw =~ /^(?=.{2048,3071}$)/m
header   SINGLE_HEADER_3K  ALL:raw =~ /^(?=.{3072,4095}$)/m
header   SINGLE_HEADER_4K  ALL:raw =~ /^(?=.{4096,5119}$)/m
header   SINGLE_HEADER_5K  ALL:raw =~ /^(?=.{5120})/m
describe SINGLE_HEADER_1K  A single header contains 1K-2K characters
describe SINGLE_HEADER_2K  A single header contains 2K-3K characters
describe SINGLE_HEADER_3K  A single header contains 3K-4K characters
describe SINGLE_HEADER_4K  A single header contains 4K-5K characters
describe SINGLE_HEADER_5K  A single header contains 5K+ characters

header   BIG_HEADERS_2K    ALL:raw =~ /^(?=.{2048,3071}$)/s
header   BIG_HEADERS_3K    ALL:raw =~ /^(?=.{3072,4095}$)/s
header   BIG_HEADERS_4K    ALL:raw =~ /^(?=.{4096,5119}$)/s
header   BIG_HEADERS_5K    ALL:raw =~ /^(?=.{5120})/s
tflags   BIG_HEADERS_2K    userconf  # unset trusted_networks can mess
this up
describe BIG_HEADERS_2K    Headers contain 2K-3K characters total
describe BIG_HEADERS_3K    Headers contain 3K-4K characters total
describe BIG_HEADERS_4K    Headers contain 4K-5K characters total
describe BIG_HEADERS_5K    Headers contain 5K+ characters total


I could have made each SINGLE_HEADER_xK rule 'tflags multiple' so that
it penalizes for multiple egregiously large headers, but I think the
BIG_HEADERS_xK rules better facilitate that.  I also suspect that the
SINGLE_HEADER_xK rules won't be as useful (which isn't to say either
will be useful ... we just have to wait for a few reasonably sized
masscheck corpora to see, which might take a while these days!).

Reply via email to