Bill Polhemus <[EMAIL PROTECTED]> wrote: > They use the > spurious HTML tags to break up the text and get it through the > Bayesian filter.
I don't see any text actually broken up. There's just not that much to trigger on. The drug names (most of which aren't in the default rules yet) are broken up with hyphens ("IONA-MIN, ADI-PEX, TENU-ATE", "AM-BIEN, ZO-LOFT, VIA-GRA, TRAMA-DOL"), as is the word "prescription". The hyphens do seem to be a common new technique, which does make finding keywords a bit difficult. Maybe we can test for excessive use of hyphens? One fairly easily detectable spam sign is the almost-white text (used to hide the irrelevant words), like this: > <font face="Arial"><font color="#FFFFF2">argumentation scabby > writhe</font> ><font color="#FFFFF2"><br> dent unerring attract</font> That should have triggered HTML_FONT_INVISIBLE, but I think that test has some bugs. Another thing that we should be checking for is stuff like this: > <A href="http://ewtajsland.b&# > 105;z/rmp6651/">Visit_to_begin_your_order</A> There's a test for something similar, SPAM_FORM_ACTION, but it needs to be expanded to test for HREFs as well, as for URLs that are only partially HTML-escaped. -- Keith C. Ivey <[EMAIL PROTECTED]> Washington, DC ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk