> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
>> You definitely have a good point that it would only be necessary to
>> track the combinations that actually show up in emails, however
>> 1024 is only the possible combinations from one set of 10 rules.
>> The number of combinations in the actual corpora would be much
>> higher.  I'll try to get you a number.

On 10/10/2011 06:55 AM, Marc Perkel wrote:
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a hash
> to look up the combinations.

The data is all there if you have access to the spam.log and ham.log
files created by mass-check (warning, this code was composed in email,
not vim, and it has not been run):

#############################
#!/bin/sh
# Give three rules as arguments.  Assumes ham.log and spam.log in PWD

export GREP_OPTIONS="--mmap"

tp=`grep -w "$1" spam.log |grep -w "$2" |grep -wc "$3"`
fp=`grep -w "$1"  ham.log |grep -w "$2" |grep -wc "$3"`

spams=`grep -c '^[^#]' spam.log`
hams=` grep -c '^[^#]' ham.log`

tpr=`echo "scale=5; $tp * 100 / $spams" |bc`
fpr=`echo "scale=5; $fp * 100 / $hams " |bc`

so=`echo "scale=4; $tpr / ($tpr + $fpr)" |bc`

echo "meta rule  $1 && $2 && $3"
echo "  SPAM% $tpr   HAM% $fpr   S/O $so"
#############################

Now you can pick your thresholds for moving forward (and your thresholds
for saving a combination as a no-go in the future).  These numbers are
just as valid as anything you'd get through the actual mass-check run.

Still, I worry about what this does to the GA.


PS:  As an SA Committer, do I have access to those logs?

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to