Erik B. Berry wrote:

> > I was aware of the stuff you're pointing out below.  This is basically caused by
> > using the new evolver to do the scoring.  Previously, scores were limited to the
> > range 0.01-5, now they are unlimited, and allowed to go -ve.  A side effect of
> > this is that rules which are really non-discriminators end up sometimes getting
> > odd-looking scores.  For example, CYBER_FIRE_POWER is just not likely to really
> > be worth -4.020 if looked at in isolation, but it turns out that the 10 messages
> > in the corpus which trigger that rule also trigger about a billion other ones.
> 
>   I don't know how you are assigning the default rule scores for the initial run of 
>the evolver, but
> how about calculating the ratio of spam/nonspam for each "commonly" encountered 
>rule, scaling
> somehow those ratios from 0 to -5 (for ratios < 1) and from 0 to 5 (for ratios > 1), 
>assigning those
> scores as the default scores to the GA evolver, and then letting it run for a bit to 
>see if removes
> some of the very odd scores.  I only suggest this in case the evolver is using 
>random scores as the

Actually, the default scores are all scaled to 0..5, not -5 to 5, since it's 
still using the code built for Justin's original evolver.  I agree initializing 
with random scores is not necessarily a great idea in this case.

> default values, which I'm sure some AI libraries do by default.  I'm not convinced a 
>GA could be any
> better than a neural net for this purpose, but I'm not an expert in AI.

I'm not a particularly religious person, but neural nets suck ass for most
things.  Particular things where you have lots of degrees of freedom and not a
lot of time to come up with a solution.  Plus, I've never seen a "real"  
application which was solved by a neural net that wasn't in some way actually
not really a neural net.  Certainly not commercially; I'm willing to concede
though there might be a professor somewhere who eats drinks and sleeps neural
nets who has come up with something.

>   The problem with relying mostly on the GA scores compared to "common sense" scores 
>(possibly with
> some GA influence) is that the spam caught by the default scores then becomes more 
>and more
> dependent on the corpus in use.  The non-spam corpus might contain more techie speak 
>than most users
> receive, for example.  Granted, SA is configurable for just this reason, but I doubt 
>most end-users
> (not admins) tweak the configuration.  The spammy corpus is likely fairly acccurate 
>if they are all
> recent (spam changes fast).  My other hunch is that relying on the GA also makes the 
>resulting
> scores more dependent on the assumption that the users have the default value of 5 
>as the score
> cutoff.

You're assuming the algorithm is designed in a way where it's going to be 
overfitting to the corpus.  Actually, the fitness function is reasonably "loose" 
in both justin's and my evolvers.  The scores will probably be better for techie 
mail than for non-techie mail, but I think they'll be pretty good for both.  I 
do strongly wish we had more non-techie nonspam in the corpus, but there is more 
than none in there now.  Just in my own mail archive, I have 3 years of emails 
sent and received during a time where I was starting, running, then selling a 
company, hiring, firing, communicating with lawyers, swapping emails with 
investors (try differentiating a solicitation from a VC from a piece of spam) 
and customers, marketing agencies, employees, etc, etc, etc.  I'm intentionally 
trying not to swamp that out too badly with email from more purely tech-focussed 
people.

C


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to