On 10/10/2011 9:16 AM, dar...@chaosreigns.com wrote:
On 10/10, Marc Perkel wrote:
On 9/28/2011 8:02 AM, dar...@chaosreigns.com wrote:
On 09/28, Marc Perkel wrote:
You would only have to test the rule combinations that the message
actually triggered. So if it hit 10 rules then it would be 1024
combinations. Seems not to be unreasonable to me.
You definitely have a good point that it would only be necessary to track
the combinations that actually show up in emails, however 1024 is only
the possible combinations from one set of 10 rules. The number of
combinations in the actual corpora would be much higher. I'll try to
get you a number.
You wouldn't have to store all combinations. You could just do up to
3 levels and only the combinations that actually occur and use a
hash to look up the combinations.
I never said storage would be a problem. I agree you could just store a
relatively small number that were most useful.
The problems are:
1) The many years it would take to find useful rule combinations by trying
one possibility per masscheck run.
2) The hundreds of times as much (masscheck) data we'd need to get an
accurate re-score using all rule combinations existing in the corpora.
There is still the possibility of doing an analysis of what combinations of
rules hit false-negatives significantly more often than they hit non-spam.
(Or false-positives vs. spam.)
I suppose it seems to me that there should be some automated way to find
useful rule combinations.
--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400