----- Original Message ----- From: "Gary Funck" <[EMAIL PROTECTED]> To: "Spamassassin List" <[EMAIL PROTECTED]> Sent: Friday, August 01, 2003 7:18 PM Subject: RE: [SAtalk] those pesky small v*agra ads
> > -----Original Message----- > > From: Mark > > Sent: Friday, August 01, 2003 8:34 AM > [...] > > > > That is the good news. :) The bad news is, that the true > > background color, or I should say, background appearance, > > is almost impossible to determine. Consider table colors, > > <td> colors, etc. Not to mention that white, stretched gif > > used for background color. And that is just 'old' style HTML. > > :) > > > > Hence I gave my rules a low score. But still, you might find > > them useful. > > Clearly, Spamassassin should invoke a browser to render the html, > convert that to a graphic file format, and then run an OCR > algorithm on the result. Then SA can run it's body filters on > the extracted text. > > What's the problem? <g> Actually, I was thinking the exact same thing; the <g> included. :) Because that really would be only way. In the original poster's case, > <body><font color="#ffffff">satchel <font color="#ffffff">brains <font > color="#ffffff">alexander <font color="#ffffff">evacuation <font > color="#ffffff">metier <font color="#ffffff">extant <font > color="#ffffff">crept <font color="#ffffff">bonaparte <font > color="#ffffff">ar <font color="#ffffff">testifiers <font This particular one could easily be caught with this rule: full FONT_PUSH_BEFORE_POP /\<[^>]*?font[^>]*?\>[^<]*?\<[^>]*?font[^>]*?\>/mi describe FONT_PUSH_BEFORE_POP HTML: <font> immediately repeated before </font> score FONT_PUSH_BEFORE_POP 2.0 And that is certainly worth a 2.0 score, as this smacks of a spammer trick trying to avoid SA checking for legitimately closed HTML markup code -- in this case: <font></font>. But I think -- barring obvious spammer stupidity, like the above text -- combatting invisible text will be very hard. It may not be totally hopeless, though. HTML being interpreted sequentially, there is a relatively easy way to determine background color: 1) DO: Read the bgcolor property of the <body> tag, and set that as our bgcolor token (and assume #ffffff when absent). 2) DO: Read forward, and reset the bgcolor token when you encounter the next bgcolor property. 3) MATCH: Appearing text has the color of first 'look-behind' bgcolor property. That is the basic outline. Now, step 2 requires fine-tuning, as it needs to also take into account the "class" attribute inside <td>, <div>, etc. In that case, step 2 has an extra pass: 2) DO: Read forward, and reset the bgcolor ... 2a) DO: if "class" attribute, decode color attribute, and set our bgcolor token accordingly. But essentially the principle remains the same: the first 'look-behind' (bg)?color property (or the fallback body bgcolor) will be what we can match the color of our text to. Thinking out-loud here for a moment, our parser should litterally PUSH the sequential bgcolor codes as it encounters them, so that the last bgcolor code on the stack will be our match color, and then POP the stack for each 'closed' bgcolor. That would get tricker for imported style-sheets and all; but imported style-sheets (with http://, not via a relative directory, as you would see in regular HTML), could itself be marked "suspect". Well, it is all a bit easier said than done, but not hopeless per se. - Mark ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk