> From: Robert Menschel > Sent: Friday, January 09, 2004 7:34 PM [...] > > Evil rules, if/when guaranteed, can be scored at or above your spam > threshold. An example from my personal files (where my spam threshold is > 9): > uri RM_u_530000x /530000x\.net/i > describe RM_u_530000x body contains link to known spammer > score RM_u_530000x 9.000 # 582s/0h of 81383 corpus > > Additive rules should be analyzed for how many hits should flag spam. If > popcorn spam hits 7 popcorn rules, and no ham hits 6 popcorn rules, and > your spam threshold is 5, then a good score for the popcorn family is > 5/7. Or if you want to be conservative, 4.5/7, requiring something else > to hit to complete the spam flag. > > All other rules should be subject to GA. > > The above is my stance when being agressively anti-spam. > > However, the other half of the time, I'm conservatively anti-spam, and I > recognize that putting ALL the rules through a GA doesn't hurt the > effort. It may weaken some rules, but it strengthens the overall effort. > > We can't all run a GA. I haven't even figured out how to do it yet. I've > gotten quite good at running mass-checks on rules and rule sets, and run > some mass-check on something almost daily. I also have a variety of > algorithms by which I determine what scores to use for which rules. But > those algorithms are based on flat corpus statistics, and not on any > evolutionary exploration of the scores themselves. > > Lacking a simple way to run a GA, I find intelligent if flat one-pass > algorithms to be very useful. And the simplest of those does apply to > BigEvil -- if it's BigEvil, it's spam.
Hmm, perhaps not always. If I send a note to you regarding a URL that I ran into that isn't in my rev. of the Evil list but is in yours, then we could argue that my note isn't spam, even though it references an Evil URL. That said, there is still a place for whitelisting, not filtering lists that discuss spam, and such. Here's an idea that I've been considering for a while: have SA change its scoring strategy to use a Neural Net, instead of using the strictly additive scoring. SA would still use its custom rules to detect spam markers, but it would let the NN do the scoring. The advantages of neural nets are that they are generally good at descriminating binary categories -- after training the NN can choose to give some attributes a single weighting that would drive the decision of spam/ham, or they can combine the weights of verious attributes. A disadvantage, or perhaps an advantage, is that it is difficult to understand the effect that a single factor would have on the decision process. Another disadvantage is that it may be difficult to combine the results provided by the community at large when running a new release's rule set. I think there is a way to handle this: users would provide a file where each line records the rules that were hit for each message, and a designation as to whether this was ham or spam. All of these sequences would be combined to train the final neural net. I found a few papers describing experiments where a Neural Net was used to make spam/ham decisions using very simple criteria. The results were good, but not impressive, however, I felt that the test was oversimplified. I think it would be "interesting" to train a Neural Net using the various features that SA detects. I doubt that such an approach will gain favor in a production release of SA, but I can see where it might be useful in a localized context. ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk