Alex wrote on Sun, 28 Mar 2010 13:38:25 -0400:

> I have a bayes db that's about 160MB with a 40MB token db on a system
> with about 100k messages per day.

Well, what's the missing 120 MB? The journal? Do a complete sync and then 
delete it.

I've just raised the max_db_size set
> to 1.1M tokens (there are currently 1.06M tokens in there).

That's not much for a system with 100.000 messages a day. I don't mean 
it's not sufficient, it is just not "too much". You should be aware that 
the expiry kicks in at 75%, not at 100% of max_db_size.

I've also
> changed bayes to write to the journal instead of directly to the
> database and just checking it periodically to see if the journal needs
> to be synced.

I suggest you change to SQL. This eliminates the journal.

> 
> Can someone explain to me the relationship between the frequency of
> "1-occurrence tokens" and the size of the database? Here is the output
> from a recent manual sync:
> 
> token frequency: 1-occurrence tokens: 72.60%
> token frequency: less than 8 occurrences: 18.11%
> 
> I was thinking that the because the tokens are seen only once,

it probably means you get a lot of fresh tokens in. Do you autolearn?

the
> database was too big, so I lowered it back down, but I think that was
> a mistake.

"too big" is not an absolute figure. If you store 1-occurence tokens you 
will obviously have more tokens than without them. If you slash the db 
(which slashes from all tokens, not just those 1.o ones) and the 
performance goes down afterwards that was obviously a wrong decision ;-) I 
don't know if and how this is reflected in the database itself in size. 
This is a DBM database which will have certain sizes by design no matter 
how many tokens are in it. If the token database is only 40 MB that is not 
overly large, it's normal.

Now some of the same emails are continually hitting only
> BAYES_50 while others seemingly the same hit BAYES_99. I've now raised
> the number of tokens available and continue to manually train the
> database with spam and ham (there are about 1.1M spam and 500k ham
> currently).

You should use autolearn if you don't do yet. If you want to be safe you 
can change the learning thresholds to safer values. (I think I use 8 for 
spam and keep the default for ham.)

> Have I configured something wrong, or am I misunderstanding how this
> works? Is there something else I should read?

I think your db was ok as it was. You should read how to change to SQL 
;-) Do the expiry once per night per cron.

Kai

-- 
Get your web at Conactive Internet Services: http://www.conactive.com



Reply via email to