decoder wrote:
LuKreme wrote:
This is an excellent idea, but it also needs rule hits on ham, right?
You're right if you're saying that the method would work better if
there were more ham rules. From what I have seen in my experiments
however, the results are also very precise with the current SA
ruleset. But any rule that adds some information to the feature set
might yet increase the performance (especially the performance on
unrecognized spam, on ham/spam which is detected by SA as well, the
algorithm performs nearly as good as SA itself).
What I'm thinking, once this gets working, is to write what I'll call
"informational rules". These rules would by themselves be 0 point rules
and might at best be only slight indicators of spam vs. ham, but when
combined with other rules would enhance the ability to form accurate
metarules. And perhaps tokens can come from other things that just
rules. Like the countries the message has passed through. Or individual
word rules that we stopped using a long time ago. Marketing phrases.
I remember when Bayes first came out that we discovered that RED text
was a stronger indicator of spam than words like viagra. I'm hopeful
that this is going to give us a breakthrough like that where we find
that interesting combinations change the way we see spam filtering.
I'm looking forward to seeing what comes of this.