On 31 Jan 2003, Jason Kohles wrote:

> On Fri, 2003-01-31 at 12:23, Bob Apthorpe wrote:
> > On 31 Jan 2003 12:04:17 -0500
> > Jason Kohles <[EMAIL PROTECTED]> wrote:
> >
> > > There are also many webservers that provide the ability to define your
> > > own tags (Roxen's RXML, and IIS front-page extensions for example).
> >
> > True, but do those show up in email? Should they? (rhetorical questions
> > answered only by looking through a different mail corpus than mine.)
> >
> I have a lot of this stuff in my non-spam corpus mainly from webserver
> mailing lists and web project discussions for projects that use these
> features.

A small amount of whitelisting should allow valid list traffic if SA
started flagging non-standard tags. And that would be great if everyone
had the knowledge, willingness, and control to create custom rules.

Is it worth investigating modules like HTML::Clean or HTML::Tagset to
detect HTML mail crapped up[1] with non-standard tagging or excessive
commenting? Compare the size of:

- raw HTML content
- content w/o comments
- content w/o comments & non-standard tags
- content w/o any tagging

Provided the overhead isn't huge, you should get nice numerical metrics
for comment fraction, non-standard tag fraction, and content/HTML ratio.
Throw in invisible text fraction for good measure.

Worse comes to worse, one could extend these modules to recognize common
proprietary tagging to let the Microsoft dross through unscathed.

I don't know if that's really necessary though. Does SA really need a
full-blown HTML analyzer built in? I suspect that once you strip invisible
text and all HTML tagging, the resulting content will be unambiguously
spam or ham, or completely empty.

See:
http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/lib/HTML/Tree/Scanning.pod
http://search.cpan.org/dist/HTML-Clean/
http://search.cpan.org/author/SBURKE/HTML-Tree-3.17/
http://search.cpan.org/author/SBURKE/HTML-Tagset-3.03/Tagset.pm

-- Bob
[1] Insert tiresome snarky comment about HTML in email being crap enough
here.



-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to