On 31 Jan 2003, Jason Kohles wrote:
> On Fri, 2003-01-31 at 12:23, Bob Apthorpe wrote: > > On 31 Jan 2003 12:04:17 -0500 > > Jason Kohles <[EMAIL PROTECTED]> wrote: > > > > > There are also many webservers that provide the ability to define your > > > own tags (Roxen's RXML, and IIS front-page extensions for example). > > > > True, but do those show up in email? Should they? (rhetorical questions > > answered only by looking through a different mail corpus than mine.) > > > I have a lot of this stuff in my non-spam corpus mainly from webserver > mailing lists and web project discussions for projects that use these > features. A small amount of whitelisting should allow valid list traffic if SA started flagging non-standard tags. And that would be great if everyone had the knowledge, willingness, and control to create custom rules. Is it worth investigating modules like HTML::Clean or HTML::Tagset to detect HTML mail crapped up[1] with non-standard tagging or excessive commenting? Compare the size of: - raw HTML content - content w/o comments - content w/o comments & non-standard tags - content w/o any tagging Provided the overhead isn't huge, you should get nice numerical metrics for comment fraction, non-standard tag fraction, and content/HTML ratio. Throw in invisible text fraction for good measure. Worse comes to worse, one could extend these modules to recognize common proprietary tagging to let the Microsoft dross through unscathed. I don't know if that's really necessary though. Does SA really need a full-blown HTML analyzer built in? I suspect that once you strip invisible text and all HTML tagging, the resulting content will be unambiguously spam or ham, or completely empty. See: http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/lib/HTML/Tree/Scanning.pod http://search.cpan.org/dist/HTML-Clean/ http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/ http://search.cpan.org/author/SBURKE/HTML-Tagset-3.03/Tagset.pm -- Bob [1] Insert tiresome snarky comment about HTML in email being crap enough here. ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk