[EMAIL PROTECTED] (Justin Mason) writes: >> I still think my submissions of spam that isn't flagged by the system is of >> particular usefulness, am I deluding myself? It seems like FP's or FN's >> woul d always be the most useful additions. > > Well, too many and they'll overbalance the corpus [1] so it no longer > simulates a real mail load. Lots of "generic" spam and nonspam is > important, too, so rule hit-rates are measured accurately.
I said before that I understood the reasons why mismarked messages in the respective corpuses (corpuscles? :-)) was a bad thing, but I don't understand this twist. I played around with GA's a decade or so ago when they first gained mainstream memeshare, but my knowledge is weak enough that your statement is counter intuitive. As long as I have a representative sample, why would a preponderance of 'typical' spam be needed to make the end scores reasonable. As long as typical spam is represented, won't the GA find a fit that matches it? Further, and now I'm really stretching to make a point, I work at a physics lab and essentially the entire reason for our computer systems is to filter out the typical events in order to keep the interesting ones for later study. The reason for this is precisely because the typical events are well understood and accounted for. But the events that break the system are the ones most critical to making new discoveries. Why isn't this true in this case as well? Why aren't the atypical cases highly prized commodities that must be incorporated in order to get better matches in the future? If the atypical events aren't added to the system every single time they occur (ideally) when will the GA ever have a chance to fit them? Conversely I can see why an extremely large set of non-spam is important in order to verify the fit against a diverse population. However, again, it isn't *really* the size of the set, but the diversity in the population. Is your statement really one of, we have no way of assuring a diverse population except by incorporating large quantities? This is a really interesting system when you stop and think about it! Thanks again for putting it together! (You too Craig) rw2 ------------------------------------------------------- This sf.net emial is sponsored by: Influence the future of Java(TM) technology. Join the Java Community Process(SM) (JCP(SM)) program now. http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk