Bart Schaefer wrote:
> On 4/29/06, Matt Kettler <[EMAIL PROTECTED]> wrote:
>> Besides.. If you want to make a mathematics based argument against me,
>> start by explaining how the perceptron mathematically is flawed. It
>> assigned the original score based on real-world data.
>
> Did it?  I thought the BAYES_* scores have been fixed values for a
> while now, to force the perceptron to adapt the other scores to fit.
>
Actually, you're right..I'm shocked and floored, but you're right.

 In SA 3.1.0 they did force-fix the scores of the bayes rules,
particularly the high-end. The perceptron assigned BAYES_99 a score of
1.89 in the 3.1.0 mass-check run. The devs jacked it up to 3.50.

That does make me wonder if:
    1) When BAYES_9x FPs, it FPs in conjunction with lots of other rules
due to the ham corpus being polluted with spam. This forces the
perceptron to attempt to compensate.  (Pollution always is a problem
since nobody is perfect, but it occurs to differing degrees).
   -or-
    2) The perceptron is out-of whack. (I highly doubt this because the
perceptron generated the ones for 3.0.x and they were fine)
  -or-
    3) The Real-world FPs of BAYES_99 really do tend to also be cascades
with other rules in the 3.1.x ruleset, and the perceptron is correctly
capping the score. This could differ from 3.0.x due to change in rules,
or change in ham patterns over time.
  -or-
    4) one of the corpus submitters has a poorly trained bayes db.
(possible, but I doubt it)

Looking at statistics-set3 for 3.0.x and 3.1.x there was a slight
increase in ham-hits for BAYES_99 and a slight decrease in spam hits.
3.0.x:
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
43.515     89.3888     0.0335     1.000     0.83     1.89     BAYES_99
3.1.x:
OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
60.712     86.7351     0.0396     1.000     0.90     3.50     BAYES_99

Also to consider is set3 of 3.0.x was much closer to a 50/50 mix of
spam/nonspam (48.7/51.3) than 3.1.0 was (nearly 70/30)

Reply via email to