On Wed, Feb 06, 2008 at 05:02:46PM +0100, Paolo Cravero wrote:
> Arthur Dent wrote:
>
>> Learned tokens from 8 message(s) (3165 message(s) examined)
>> Learned tokens from 4628 message(s) (8703 message(s) examined)
>> Learned tokens from 3890 message(s) (8634 message(s) examined)
>> Learned tokens from 2264 message(s) (8671 message(s) examined)
>> Learned tokens from 2303 message(s) (8620 message(s) examined)
>
> "Odds 2,000,127 against one... and counting..."
*
>
>> Notice that although the amount of tokens being learned seems to be coming
>> down gradually, the total far exceeds the total amount of ham mails in the
>> corpus.
>
> The number of *messages* learned is decreasing, not the number of tokens.
Yes, sorry, lack of precision on my part. I meant number of messages of
course. But the point still stands, the process seems to have learned tokens
from a decreasing number of messages each time, but still, as these are
largely the same messages as the previous day it could not have processed
tokens from 13,085 messages as there are only around 8,650 in the corpus. (see
below for explanation).

> Could it be that something deletes the temp folder before sa-learn has 
> finished, so it gets distracted and starts flying away carrying a suitcase?

Hmmm... Not delete exactly, but the sa-learn job take so long that the
archivemail job has kicked off and finds the "TempSpam" and "TempHam" mboxes
in the Mail directory and dutifully chops out anything older than 180 days. I
didn't think that that would be a problem, but maybe it's upsetting sa-learn?
I will try switch the order of the jobs (archivemail running first) and see if
that makes a difference.. 

> Or do you receive >8600 messages each day? Some of them might have been 
> autolearned on the incoming SMTP channel, BTW.

Well, as I explained in my previous post, the "TempHam" folder is a
concatenation of all my non-spam folders. Mail that is older than 180 days is
taken off at one end and new mail (c. 30-40 per day) added on at the other.
The total remains roughly constant.

> IMHO it is not necessary to train so extensively the Bayes DB. If you want 
> the process to complete in a decent amount of time, feed it fewer messages 
> at a time.

Agreed, but I want to give it a good mix of ham that includes regular mail,
mailinglists (such as this one), newsletters, work stuff, etc. It just seemed
easier to lump everything together and feed it to sa-learn...

>
> Paolo

Thanks

Mark


> PS: who knows who "Arthud Dent" is/was, will understand the oddities in 
> this reply. All others: get a copy of the HHGTTG. :-)
* - But the problem is that my infinite improbability drive is broken. If only
I could just have a nice cup of tea...


Attachment: pgpBjYkj2mcIx.pgp
Description: PGP signature

Reply via email to