[EMAIL PROTECTED] (Justin Mason) writes:

>> I still think my submissions of spam that isn't flagged by the system is of
>> particular usefulness, am I deluding myself?  It seems like FP's or FN's
>> woul d always be the most useful additions.
>
> Well, too many and they'll overbalance the corpus [1] so it no longer
> simulates a real mail load.  Lots of "generic" spam and nonspam is
> important, too, so rule hit-rates are measured accurately.

I said before that I understood the reasons why mismarked messages in the
respective corpuses (corpuscles?  :-)) was a bad thing, but I don't understand
this twist.

I played around with GA's a decade or so ago when they first gained mainstream
memeshare, but my knowledge is weak enough that your statement is counter
intuitive.

As long as I have a representative sample, why would a preponderance of
'typical' spam be needed to make the end scores reasonable.  As long as
typical spam is represented, won't the GA find a fit that matches it?
Further, and now I'm really stretching to make a point, I work at a physics
lab and essentially the entire reason for our computer systems is to filter
out the typical events in order to keep the interesting ones for later study.
The reason for this is precisely because the typical events are well
understood and accounted for.  But the events that break the system are the
ones most critical to making new discoveries.  Why isn't this true in this
case as well?  Why aren't the atypical cases highly prized commodities that
must be incorporated in order to get better matches in the future?  If the
atypical events aren't added to the system every single time they occur
(ideally) when will the GA ever have a chance to fit them?

Conversely I can see why an extremely large set of non-spam is important in
order to verify the fit against a diverse population.  However, again, it
isn't *really* the size of the set, but the diversity in the population.

Is your statement really one of, we have no way of assuring a diverse
population except by incorporating large quantities?

This is a really interesting system when you stop and think about it!

Thanks again for putting it together! (You too Craig)

rw2


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to