Le 18/01/2011 10:46, Jeff Chan a écrit :
2. Some of the areas are very difficult to resolve into spam or
ham. Some more aggressive anti-spammers may say all of the above
is spam, but others may disagree, and the mail may be legal.
I'd suggest that SA ought to be classifying e-mail in *three* broad
categories, not two.
Firstly, definite spam, unsolicited in any way.
Secondly, definite ham (i.e. primarily genuine person-to-person e-mail
and actively solicited messages such as confirmations of website
transactions), which even the most aggressive spam-fighters would agree
is ham. FPs in this category are bad news.
And thirdly, an in-between category, of which opt-in advertising is a
prime example, which at least some users are happy to receive, but where
FPs aren't a major problem.
With a few relatively rare exceptions, SA already classifies these
categories pretty effectively, especially with a well-trained bayesian
db. Genuine ham tends to come in with negative scores, occasionally
straying up to about 1 or 2. Likewise, undisputed spam rarely scores
less than 8 or 10. And opt-in advertising typically comes in with
"neutral" scores of 0 to 4. So far, so good.
Using this opt-in advertising, which IMO ought to be getting neutral
scores, as a ham corpus, is inevitably going to be problematic. Using it
as a third, neutral corpus that is given far less weight than genuine
ham would be a different matter, but would require a major change in the
the scoring algorithms.
John.
--
-- Over 4000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages - www.tradoc.fr