> -----Original Message----- > From: Mark [mailto:[EMAIL PROTECTED] > Sent: Friday, August 01, 2003 5:24 PM > To: Gary Funck; Spamassassin List > Subject: Re: [SAtalk] those pesky small v*agra ads > > > ----- Original Message ----- > From: "Gary Funck" <[EMAIL PROTECTED]> > To: "Spamassassin List" <[EMAIL PROTECTED]> > Sent: Friday, August 01, 2003 7:18 PM > Subject: RE: [SAtalk] those pesky small v*agra ads > > > > > -----Original Message----- > > > From: Mark > > > Sent: Friday, August 01, 2003 8:34 AM > > [...] > > > > > > That is the good news. :) The bad news is, that the true > > > background color, or I should say, background appearance, > > > is almost impossible to determine. Consider table colors, > > > <td> colors, etc. Not to mention that white, stretched gif > > > used for background color. And that is just 'old' style HTML. > > > :) > > > > > > Hence I gave my rules a low score. But still, you might find > > > them useful. > > > > Clearly, Spamassassin should invoke a browser to render the html, > > convert that to a graphic file format, and then run an OCR > > algorithm on the result. Then SA can run it's body filters on > > the extracted text. > > > > What's the problem? <g> > > Actually, I was thinking the exact same thing; the <g> > included. :) Because > that really would be only way. In the original poster's case, > > > <body><font color="#ffffff">satchel <font > color="#ffffff">brains <font > > color="#ffffff">alexander <font color="#ffffff">evacuation <font > > color="#ffffff">metier <font color="#ffffff">extant <font > > color="#ffffff">crept <font color="#ffffff">bonaparte <font > > color="#ffffff">ar <font color="#ffffff">testifiers <font > > This particular one could easily be caught with this rule: > > full FONT_PUSH_BEFORE_POP > /\<[^>]*?font[^>]*?\>[^<]*?\<[^>]*?font[^>]*?\>/mi > describe FONT_PUSH_BEFORE_POP HTML: <font> immediately > repeated before > </font> > score FONT_PUSH_BEFORE_POP 2.0 > > And that is certainly worth a 2.0 score, as this smacks of a > spammer trick > trying to avoid SA checking for legitimately closed HTML > markup code -- in > this case: <font></font>. > > But I think -- barring obvious spammer stupidity, like the > above text -- > combatting invisible text will be very hard. > > It may not be totally hopeless, though. HTML being > interpreted sequentially, > there is a relatively easy way to determine background color: > > 1) DO: Read the bgcolor property of the <body> tag, and set > that as our > bgcolor token (and assume #ffffff when absent). > 2) DO: Read forward, and reset the bgcolor token when you > encounter the next > bgcolor property. > 3) MATCH: Appearing text has the color of first 'look-behind' bgcolor > property. > > That is the basic outline. Now, step 2 requires fine-tuning, > as it needs to > also take into account the "class" attribute inside <td>, > <div>, etc. In > that case, step 2 has an extra pass: > > 2) DO: Read forward, and reset the bgcolor ... > 2a) DO: if "class" attribute, decode color attribute, > and set our bgcolor token accordingly. > > But essentially the principle remains the same: the first > 'look-behind' > (bg)?color property (or the fallback body bgcolor) will be > what we can match > the color of our text to. > > Thinking out-loud here for a moment, our parser should > litterally PUSH the > sequential bgcolor codes as it encounters them, so that the > last bgcolor > code on the stack will be our match color, and then POP the > stack for each > 'closed' bgcolor. That would get tricker for imported > style-sheets and all; > but imported style-sheets (with http://, not via a relative > directory, as > you would see in regular HTML), could itself be marked "suspect". > > Well, it is all a bit easier said than done, but not hopeless per se. > > - Mark > >
Wow I take one day off and fall really behind! :) 1st, Mark, Can I post your rules to the emporeum? 2nd, I like the push without pop rule. The counting of the number of "<font" entries would be a slow rule. But possible none the less. 3rd, I think the idea of rendering the html and OCRing it is backwards. I think a _seperate_ process should be invoked to a program that removes all html, runs rules that are specific to just the un-html version, and gives it a sum of points. Then it passes those points on to the regular SA process. Almost like runing SA twice. Just thoughts. Chris Santerre System Admin and SA Custom Rules Emporium keeper http://www.merchantsoverseas.com/wwwroot/gorilla/sa_rules.htm "A little nonsense now and then, is relished by the wisest men." - Willy Wonka ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk