On 1/30/2014 12:39 PM, Amir Caspi wrote:
On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <kmcgr...@pccc.com
<mailto:kmcgr...@pccc.com>> wrote:
If you want to share the complete rule, I can throw it into my
sandbox and see what masscheck thinks as well.
The complete rule would be something like this, assuming Andy
implemented it as I wrote it:
rawbody HTML_NONSENSE_TAGS/(?:<[A-Za-z0-9]{4,}>\s*){10,}/
describe HTML_NONSENSE_TAGSMany consecutive multi-letter HTML tags,
likely nonsense/spam
score HTML_NONSENSE_TAGS0.001
Score to be adjusted as needed, of course.
If one wants to be even more explicit, one could require that the tags
be prefaced with a <style> tag, although that should, hopefully, get
picked up by John Hardin's modifications to STYLE_GIBBERISH sometime
in the near future.
Cheers.
--- Amir
Added to the sandbox. In a day or three, we should be able to check the
ruleQA and see what it looks like on the masscheck corpora.
svn commit -m 'Adding html tag gibberish tag rule for testing from
Amir Caspi on the mailing list'
Adding rulesrc/sandbox/kmcgrail/20_mailing_list.cf
Transmitting file data .
Committed revision 1562916.
Rule is called AC_HTML_NONSENSE_TAGS and you can then look at it on
http://ruleqa.spamassassin.org/
The S/O is the big thing to look at http://wiki.apache.org/spamassassin/S/O
Regards,
KAM