On Thu, May 30, 2002 at 10:01:27AM +0100, Matt Sergeant wrote: > Kingsley G. Morse Jr. wrote: > >Good point. Combinations of some rules may be more > >indicative of spam than others. > > > >It would be great if the GA could infer the boolean > >logic, as well as the scores. > > It's possible that you could group the rules that matched, and feed it > into the score generating system (whatever that may be - I'm looking to > get rid of using the GA's here as it's just too slow to work with). > > You'd have to do some spanning though. For example, if an email matches > rules A B C D and E, and you decided you wanted to try scoring against > triplets, you'd need to feed the score generator: > > ABC > ACD > ADE > ABD > ABE > ACE > BCD > BDE > BCE > CDE > > (I may have missed some combinations above, but you get the idea). > > So yes, I think it can be done (and pretty easily with my new > system[1]), but it's a fair bit of work. > > Matt. > > [1] Unfortunately it's not something I can give away - not yet. Maybe > towards the end of Q3 after we've got all this running live.
Clearly, we can not do this with EVERY combination, unless Craig has a lot of CPU to spare. There are just under 400 rules right now. If we ended up with 400 tests, there would be 79800 doubles and 10586800 triplets. So, assuming the GA runs in O(n) time, (which is not at all likely to be true -- I'd guess O(n^2) if I had to), this would require 26668 times longer to generate scores. Of course this total would be less but still quite significant if doubles and triples were added as they were seen, but still, I estimate this would be extremely taxing on CPU. -- Duncan Findlay _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk