Ho list !
I just did a nice .cf that deactivates SA's old deka-step-html-percentage tests and does a hundred tests ranging from 0% to 100% (naturally *g*). Hope you find it useful, comments appreciated !
http://www.poppe-online.de/spamassassin/55_html_perc_tests.cf
As a second commentary, why are the scores you assigned linearly increasing with % HTML? Is there a reason, or just something you whipped up quickly?
The scores assigned by the GA, and the S/O's in statistics.txt indicate that increasing HTML percentage is not a good indicator of an increasing chance of spam.
As a very crude statistical measure, take a look at the S/O's here in 2.6x's STATISTICS.txt.. It's mostly increasing, but note the slight dip when you get to 80_90.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME 1.327 0.9723 1.9908 0.328 0.03 0.00 HTML_00_10 1.934 2.1962 1.4436 0.603 0.21 0.00 HTML_10_20 3.458 4.6458 1.2322 0.790 0.48 0.69 HTML_20_30 4.132 5.8439 0.9245 0.863 0.62 0.84 HTML_30_40 6.525 9.4988 0.9541 0.909 0.73 0.87 HTML_40_50 11.599 17.0960 1.3041 0.929 0.79 0.70 HTML_50_60 13.701 20.3955 1.1619 0.946 0.83 0.36 HTML_60_70 10.555 15.8318 0.6713 0.959 0.86 0.38 HTML_70_80 5.810 8.6467 0.4974 0.946 0.82 0.01 HTML_80_90 0.940 1.4251 0.0317 0.978 0.89 0.31 HTML_90_100
It'd be interesting to run your tests against a good sized corpus with mass-check.. if for no other reason than to see what the S/O curve looks like.
------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk