At 02:45 PM 12/13/2003, Bryan Hoover wrote:
> 742753 - total number of words in it
> 515654 - total number of words which have been seen only once
> 80485 - ... twice
> 35325 - ... 3 times
>
> This statistics shows that most of the db us not used, just eating my hard drive (44 MB total size). Is it normal situation ?
>

My dump magic output is similar -- I didn't run through it all, but
there are a lot of tokens with few occurrences.


Don't confuse "seen only once" with "not used". "seen only once" really means "trained only once".

If the large number of "seen only once" tokens bothers you, you can disable hapaxes.

From the Mail::SpamAssasin::Conf manpage:

       bayes_use_hapaxes        (default: 1)
           Should the Bayesian classifier use hapaxes
           (words/tokens that occur only once) when classifying?
           This produces significantly better hit-rates, but
           increases database size by about a factor of 8 to 10.

In your case, it looks like hapaxes are about 51/11ths of your database, which is less than expected.


44 mb seems large for 743k.  The docs say there should be about 5mb/100k
tokens.  You might look at your configuration expiry variables and such
if you want a smaller db.

Well, 5mb * (743/100) = 37.15mb... that's pretty close to 44mb at an estimate. Doesn't seem large at all given the specs..


It does however look like alexander should consider running sa-learn --force-expire.



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to