On 10/10, Marc Perkel wrote:
> On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
> >On 09/28, Marc Perkel wrote:
> >>You would only have to test the rule combinations that the message
> >>actually triggered. So if it hit 10 rules then it would be 1024
> >>combinations. Seems not to be unreasonable to me.
> >You definitely have a good point that it would only be necessary to track
> >the combinations that actually show up in emails, however 1024 is only
> >the possible combinations from one set of 10 rules.  The number of
> >combinations in the actual corpora would be much higher.  I'll try to
> >get you a number.
> 
> You wouldn't have to store all combinations. You could just do up to
> 3 levels and only the combinations that actually occur and use a
> hash to look up the combinations.

I never said storage would be a problem.  I agree you could just store a
relatively small number that were most useful.

The problems are:
1) The many years it would take to find useful rule combinations by trying
   one possibility per masscheck run.
2) The hundreds of times as much (masscheck) data we'd need to get an
   accurate re-score using all rule combinations existing in the corpora.

There is still the possibility of doing an analysis of what combinations of
rules hit false-negatives significantly more often than they hit non-spam.
(Or false-positives vs. spam.)

-- 
Immorality: "The morality of those who are having a better time"
- Henry Louis Mencken
http://www.ChaosReigns.com

Reply via email to