> From: Robert Menschel
> Sent: Friday, January 09, 2004 7:34 PM
[...]
>
> Evil rules, if/when guaranteed, can be scored at or above your spam
> threshold. An example from my personal files (where my spam threshold is
> 9):
> uri       RM_u_530000x           /530000x\.net/i
> describe  RM_u_530000x           body contains link to known spammer
> score     RM_u_530000x           9.000  # 582s/0h of 81383 corpus
>
> Additive rules should be analyzed for how many hits should flag spam. If
> popcorn spam hits 7 popcorn rules, and no ham hits 6 popcorn rules, and
> your spam threshold is 5,  then a good score for the popcorn family is
> 5/7. Or if you want to be conservative, 4.5/7, requiring something else
> to hit to complete the spam flag.
>
> All other rules should be subject to GA.
>
> The above is my stance when being agressively anti-spam.
>
> However, the other half of the time, I'm conservatively anti-spam, and I
> recognize that putting ALL the rules through a GA doesn't hurt the
> effort. It may weaken some rules, but it strengthens the overall effort.
>
> We can't all run a GA. I haven't even figured out how to do it yet. I've
> gotten quite good at running mass-checks on rules and rule sets, and run
> some mass-check on something almost daily. I also have a variety of
> algorithms by which I determine what scores to use for which rules. But
> those algorithms are based on flat corpus statistics, and not on any
> evolutionary exploration of the scores themselves.
>
> Lacking a simple way to run a GA, I find intelligent if flat one-pass
> algorithms to be very useful. And the simplest of those does apply to
> BigEvil -- if it's BigEvil, it's spam.

Hmm, perhaps not always. If I send a note to you regarding a URL that
I ran into that isn't in my rev. of the Evil list but is in yours, then
we could argue that my note isn't spam, even though it references an
Evil URL. That said, there is still a place for whitelisting, not filtering
lists that discuss spam, and such.

Here's an idea that I've been considering for a while: have SA change its
scoring strategy to use a Neural Net, instead of using the strictly additive
scoring. SA would still use its custom rules to detect spam markers, but it
would let the NN do the scoring. The advantages of neural nets are that they
are
generally good at descriminating binary categories -- after training the NN
can choose to give some attributes a single weighting that would drive the
decision of spam/ham, or they can combine the weights of verious attributes.
A disadvantage, or perhaps an advantage, is that it is difficult to
understand
the effect that a single factor would have on the decision process. Another
disadvantage is that it may be difficult to combine the results provided by
the community at large when running a new release's rule set. I think there
is a way to handle this: users would provide a file where each line records
the rules that were hit for each message, and a designation as to whether
this
was ham or spam. All of these sequences would be combined to train the final
neural net.

I found a few papers describing experiments where a Neural Net was used
to make spam/ham decisions using very simple criteria. The results were
good,
but not impressive, however, I felt that the test was oversimplified. I
think it
would be "interesting" to train a Neural Net using the various features that
SA detects. I doubt that such an approach will gain favor in a production
release
of SA, but I can see where it might be useful in a localized context.




-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to