For some years now there has been a lot of effective spam filtering
using statistical approaches with variations on Bayesian theory, some
of these are inverse Chi Square modifications to Niave Bayes or even
CRM114 and other "languages" have been developed to improve the
scoring of statistical analysis of spam. For all statistical
processes the spamicity is always between 0 and 1.
Before this, and along side this, has been the approach of
spamassassin wherein every email is evaluated against a library of
rules and for each rule and number of points is assigned to it.
Given enough points, the email is ham/spam. To accomodate the
Bayesian process, SA was modified with a Bayes engine and the ability
to add points depending on where the bayesian score fell (>.85, >.
95...). And for all of these processes the score is between
something negative and something positive depending on the total
number of hits and the points assigned to them.
It occurred to me that this process of assigning points to each
"HIT" (either addition or subtraction of points) is slightly
arbitrary. There is a long process of evaluating for the "most
effective score" for each rule and then providing that as the
default. The Mail Admin has the option to retune these various
parameters as needed. To me, this looks like a lot of knobs I can
turn on a very complex machine I will probably never really
understand. In short, if I touch it, I will break it. But the
arbitrary part of the process is this manual balancing act between
how many points to apply to something and getting the call from the
CEO about his over abundance of east european teenage solicitors (or
lack thereof).
The thought I had, and have been working on for a while, is changing
how the scoring is done. Rather than making Bayes a part of the
scoring process, make the scoring process a part of the Bayes
statistical Engine. As an example you would simply feed into the
Bayesian process, as tokens, the indications of scoring hits (binary
yes/no) would be examined next to the other tokens in the message.
It would be the Bayes process that determines the effective number of
points you assign for each HIT based on what it's learned about it
from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
represented as a token of format:
ADVANCE_FEE_1=YES or NO
ADVANCE_FEE_2=YES or NO
and each of these tokens would then be evaluated based on your
learning process.
An advantage of this would be the elimination of the process to
determine the best number of points to assign or to determine if you
even want a rule included.
Point assignments would be determined based on the statistical hits
(number of spam, number of ham) and would be tuned between a per site
or per user basis depending on the bayes engine configuration. Each
users, by means of their feedback, would tune the importance of each
rule applied.
Determining if you wanted to include a rule would be automatically
determined for you based on the resulting scoring. if you have a
rule that has an overall historical performance of 0.499 then it's
pretty obvious that it's incapable of "Seeing" your kind of spam/
ham. But if you throw together a rule and run it for a week and find
it's scoring 0.001 or 0.999 then you have evidence of how effective
the rule is and can continue to use it. It is conceivable that you
could start with All known rules and later on remove all the rules
that are nominally 0.500 to improve performance on a objective
process. It would also apply to any of the networked rules like
botnet, dcc, razor because they just have a tagline and a YES/NO
indication.
I've been working on something like this myself with great affect,
but it would be far more practical to utilize much of the knowledge
and capability that already exists in spamassassin. But I'm not
familiar enough with spamassassin to know how to gain visibility into
all the rules run and all their results (hits are easy in
PerMsgStatus, but misses are not). If someone would be willing to
give me some pointer to a roadmap of sorts it would be appreciated.
Many Thanks for those of you who have read this far for your patience
and consideration.