> On Mon, 27 Aug 2007, OliverScott wrote:
> >1. Most users don't know how, arn't allowed, or can't be bothered to train
> >Bayes. In most cases spamassassin is left to auto-train bayes.

On 27.08.07 09:46, Chris St. Pierre wrote:
> Disagree.  With proper training -- or if you make it trivially easy,
> like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes.

Yes, but according to what YOU mention in point 2, this may be
contraproductive...

> >2. Most people would consider the same emails to be SPAM. 90% of what I
> >think is spam would also be what you think is spam, with only a small
> >percentage of emails that we disagree on.

> Strongly disagree.  Many users consider anything they don't want to be
> spam, including all sorts of soliticed email.

The fact that users don't differ between mail they subscribed to, may speak
against personalized BAYES database. Otherwise some users will taint their
database and it will become less and less effective. Of course, their
reporting should go to personal bayes, not the shared one. If they have to
teach the bayes database, they should teaht their own.

However users should be well-informed that "report as spam" may be
problematic in such ways.

> >4. Site wide bayes saves disk space and more importantly it saves
> >significantly on disk IO or memory requirements.
> 
> Not sure on this one.  None of the performance statistics I gather saw
> any noticeable hit when I switched from sitewide to per-user.

shared database will take less disk space (and less memory when loaded) and
will probably be most of the time in memory, so it won't get loaded very
often. However I don't think this will help much in efficiency...
 
> >5. A larger database leads to more accurate baysian identification - I am
> >guessing this is right?
> 
> "It depends." :)  With Bayes poisoning all the rage, it sometimes
> helps to avoid a really huge database.

someone mentioned here that the bayes poisoning is a myth... I'm not sure
how much truth is in that, but my BAYES filter works well for some time...

> So what's important is having a well-tuned database -- not necessarily
> a large database.

a large well-tuned database is much better than small fine-tuned database.
For much users it has to be larger, because much users get much of different
e-mail.

> If Joe and Jane User get different kinds of mail, disagree on what spam
> is, etc., then they should have different databases.  (What if Joe
> receives a legitimate newsletter on stock tips, for instance?)

how can Jane get legitimate newsletter on stock tips when she didn't ask for
them? How can they be legitimate if she does not want them?
(provided she did what she could for not receiving them)

> With a diverse user base, any sort of one-size-fits-all filtering is
> bound to increase FPs and FNs.

Yes, however the default scores for BAYES filters are not that big so shared
database won't change score that much :)

Also, note that one simple word will never change BAYES score that much, so
I would not be that afraid that one word "viagra" would change much in final
score.

-- 
Matus UHLAR - fantomas, [EMAIL PROTECTED] ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Due to unexpected conditions Windows 2000 will be released
in first quarter of year 1901

Reply via email to