> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote: >> You definitely have a good point that it would only be necessary to >> track the combinations that actually show up in emails, however >> 1024 is only the possible combinations from one set of 10 rules. >> The number of combinations in the actual corpora would be much >> higher. I'll try to get you a number.
On 10/10/2011 06:55 AM, Marc Perkel wrote: > You wouldn't have to store all combinations. You could just do up to > 3 levels and only the combinations that actually occur and use a hash > to look up the combinations. The data is all there if you have access to the spam.log and ham.log files created by mass-check (warning, this code was composed in email, not vim, and it has not been run): ############################# #!/bin/sh # Give three rules as arguments. Assumes ham.log and spam.log in PWD export GREP_OPTIONS="--mmap" tp=`grep -w "$1" spam.log |grep -w "$2" |grep -wc "$3"` fp=`grep -w "$1" ham.log |grep -w "$2" |grep -wc "$3"` spams=`grep -c '^[^#]' spam.log` hams=` grep -c '^[^#]' ham.log` tpr=`echo "scale=5; $tp * 100 / $spams" |bc` fpr=`echo "scale=5; $fp * 100 / $hams " |bc` so=`echo "scale=4; $tpr / ($tpr + $fpr)" |bc` echo "meta rule $1 && $2 && $3" echo " SPAM% $tpr HAM% $fpr S/O $so" ############################# Now you can pick your thresholds for moving forward (and your thresholds for saving a combination as a no-go in the future). These numbers are just as valid as anything you'd get through the actual mass-check run. Still, I worry about what this does to the GA. PS: As an SA Committer, do I have access to those logs?
signature.asc
Description: OpenPGP digital signature