On Mon, June 17, 2013 11:48 am, John Hardin wrote: > Well, that's a much harder problem. STYLE tags have a specified format, > and content not matching that format is (fairly) easy to detect. Comments > are freeform text - "gibberish" has the same meaning there that it does in > regular body text. > > It's *possible* that converting the __LONGWORDS rules from body to rawbody > and making them multiline would be justified, but there would have to be > some discussion about that. They are at present unbounded and doing that > conversion blindly could be Very Bad. > > Perhaps a better approach would be to modify the HTML parser plugin to > support rules regarding the size of HTML comments. This also could be done > in a rawbody rule, but the size of comments may not be a useful spam sign.
All of the HTML comment garbage I've seen would explicitly match something like: <!-- ([A-Za-z0-9.+-]+[/\n ]){300,} --> That is, "word" characters with some punctuation, generally space-delimited (though I've seen some that are slash-delimited), with lengths of 300 words or more. Newlines are included in the delimiter class to allow for splitting over multiple lines. Obviously, this won't catch them all, but it should catch most of the comment garbage, I think. (I can look through my FNs to see if there are any other potential patterns.) I have received a few multi-part spams in the past few weeks, where the message is (I guess) too long to pass through the MTA... not sure if it gets split by my mail server or somewhere upstream. In those cases, the ending portion of the comment is in part 2 or 3 of the email... I don't know if spamd runs on the entire email before it gets split, or on the individual pieces. If the latter, one could consider two rules, one which matches whole comments, one which matches either beginning or end (and the middle content would then have to be correspondingly larger to accommodate the fact that this is a split message and thus huge comment). For what it's worth, I think the size of the comments could well be a good rule. I can look through my ham but I'm pretty sure that none of it has enormous comments like the spam does. These comments contain multiply kilobytes of text. I have never seen a ham email that contains multiple KB of commented material that includes hundreds, sometimes thousands of words. Obviously I understand the problem with FPs and the potential disaster of creating the rule badly. I think if you require something like 300+ words within the comment, that would be sufficient to rule out basically every ham. You could also give it a relatively low score like 1.5, so that it adds to spamminess without forcing spam=yes on messages that truly are ham. Thanks. =) --- Amir