Bill Polhemus <[EMAIL PROTECTED]> wrote:

> They use the
> spurious HTML tags to break up the text and get it through the
> Bayesian filter.

I don't see any text actually broken up.  There's just not that 
much to trigger on.  The drug names (most of which aren't in 
the default rules yet) are broken up with hyphens ("IONA-MIN, 
ADI-PEX, TENU-ATE", "AM-BIEN, ZO-LOFT, VIA-GRA, TRAMA-DOL"), as 
is the word "prescription".  The hyphens do seem to be a common 
new technique, which does make finding keywords a bit 
difficult.  Maybe we can test for excessive use of hyphens?

One fairly easily detectable spam sign is the almost-white text 
(used to hide the irrelevant words), like this:

> <font face="Arial"><font color="#FFFFF2">argumentation scabby
> writhe</font>

><font color="#FFFFF2"><br> dent unerring attract</font>

That should have triggered HTML_FONT_INVISIBLE, but I think 
that test has some bugs.

Another thing that we should be checking for is stuff like 
this:

> <A href="http://&#101;&#119;taj&#115;&#108;&#97;&#110;d&#46;&#98;&#
> 105;&#122;/&#114;&#109;&#112;&#54;6&#53;&#49;/">Visit_to_begin_your_order</A>

There's a test for something similar, SPAM_FORM_ACTION, but it 
needs to be expanded to test for HREFs as well, as for URLs 
that are only partially HTML-escaped.


-- 
Keith C. Ivey <[EMAIL PROTECTED]>
Washington, DC



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to