Unrecognized encodings make text rules painfully slow and give FP

Mark Martinec Wed, 09 Aug 2006 11:57:48 -0700

I recently noticed a couple of cases where SA (3.1.4 or earlier)
would take over a minute (instead of few seconds) to check a 500 kB
message. Investigation reavealed that cases have one thing in common:
these were all message/partial chunks of a longish transfer of some
document or other data. Moreover, most of these cases were hitting
random sets of SARE or baseline rules, yielding false positives.


In case someone would suggest that Content-Type: message/partial
should be banned outright - well, it is a policy decision, and
if allowed, should not bring SA to its knees on a 0.5 MB message.

Here is one example where a command-line 'spamassassin -t -D' would
run for 68 seconds. Timestamping each debug line produces the
following top-10 lines - sorted by elapsed time, first column
is time in seconds for this line to appear after a previous one:

1.935 dbg: rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC"
2.204 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
3.695 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0il"
3.976 dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "i"
4.021 dbg: rules: running raw-body-text per-line regexp tests; score ... 
6.397 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " Sjx"
8.225 dbg: bayes: tok_get_all: token count: 37175
8.254 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
9.682 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"
11.999 dbg: rules: running body-text per-line regexp tests; score so far=2.501

and another example:

2.396 dbg: rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "b0y"
2.424 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
2.627 dbg: bayes: tok_get_all: token count: 36631
3.421 dbg: rules: running body-text per-line regexp tests; score so far=0.203
3.826 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0Il"
4.181 dbg: rules: running raw-body-text per-line regexp tests; score ... 
4.265 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " S8X"
8.113 dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "XoNOgX"
9.308 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
9.945 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"

I know some of these are SARE rulesets, but some are baseline rules
or bayes token parsing.

Here is a relevant section/sample of one of these messages:

MIME-Version: 1.0
Content-Type: message/partial;
        total=22;
        id="[EMAIL PROTECTED]";
        number=21
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2869
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869

f6idzxqa608aID8+YhwNSQwBpIrboHA0/zPfOP26mB6eONz70Xl12DwGVnAPemaaKaJyQk5ZKUwg
VC0sGYHLd543cICNa1piu8YgRJR0EaEK7GNVXvFSriat5dZwj7PNzQuOTO030bra7tBjROxbrVYR
XFStjnugVkyH27zqrvUdUsHYnLaVLdUuAxWH51QDV9/kc6vtIURcdUbthPszq12lj7Lt7rMAtVX7


So the problem is that these base64-encoded lines in a message/partial
chunk are treated as obfuscated text, which is very slow, and produces
almost random hits on various rules. It also places some burden on
SQL server (bayes: tok_get_all: token count: 37175).


Somewhat similar mail cases that also hit various obfuscation rules
because of its UU-encoding being mistaken for a plain text, is mail
with attachments produced by Microsoft Office Outlook where user
has the following setting chosen:

  Tools -> Options -> Mail Format -> Internet format: plain text options:
    (YES) Encode attachments in UUENCODE format
          when sending a plain text message

It would be nice if such encodings were recognized and at least
prevent rules that expect plain text from running and/or producing
false hits.

  Mark

Unrecognized encodings make text rules painfully slow and give FP

Reply via email to