On Mon, June 17, 2013 11:48 am, John Hardin wrote:
> Well, that's a much harder problem. STYLE tags have a specified format,
> and content not matching that format is (fairly) easy to detect. Comments
> are freeform text - "gibberish" has the same meaning there that it does in
> regular body text.
>
> It's *possible* that converting the __LONGWORDS rules from body to rawbody
> and making them multiline would be justified, but there would have to be
> some discussion about that. They are at present unbounded and doing that
> conversion blindly could be Very Bad.
>
> Perhaps a better approach would be to modify the HTML parser plugin to
> support rules regarding the size of HTML comments. This also could be done
> in a rawbody rule, but the size of comments may not be a useful spam sign.

All of the HTML comment garbage I've seen would explicitly match something
like:
<!-- ([A-Za-z0-9.+-]+[/\n ]){300,} -->

That is, "word" characters with some punctuation, generally
space-delimited (though I've seen some that are slash-delimited), with
lengths of 300 words or more.  Newlines are included in the delimiter
class to allow for splitting over multiple lines.  Obviously, this won't
catch them all, but it should catch most of the comment garbage, I think. 
(I can look through my FNs to see if there are any other potential
patterns.)

I have received a few multi-part spams in the past few weeks, where the
message is (I guess) too long to pass through the MTA... not sure if it
gets split by my mail server or somewhere upstream.  In those cases, the
ending portion of the comment is in part 2 or 3 of the email... I don't
know if spamd runs on the entire email before it gets split, or on the
individual pieces.  If the latter, one could consider two rules, one which
matches whole comments, one which matches either beginning or end (and the
middle content would then have to be correspondingly larger to accommodate
the fact that this is a split message and thus huge comment).

For what it's worth, I think the size of the comments could well be a good
rule.  I can look through my ham but I'm pretty sure that none of it has
enormous comments like the spam does.  These comments contain multiply
kilobytes of text.  I have never seen a ham email that contains multiple
KB of commented material that includes hundreds, sometimes thousands of
words.

Obviously I understand the problem with FPs and the potential disaster of
creating the rule badly.  I think if you require something like 300+ words
within the comment, that would be sufficient to rule out basically every
ham.  You could also give it a relatively low score like 1.5, so that it
adds to spamminess without forcing spam=yes on messages that truly are
ham.

Thanks. =)

                                                --- Amir


Reply via email to