Faisal N Jawdat wrote:
> On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote:
>> Ok, but how does knowing what SA learned it as help? It doesn't.
>>
>> Figure out what to train as, and train.
>
> it helps in that i can automatically iterate over some or all of my
> mail folders on a regular basis, selectively retraining *if*:
>
> a) the message has already been trained
> b) it's been trained the same way that i want it trained in the end
> and
> c) the cost of determining it's already been trained is substantially
> lower than the cost of just training it
But what's the point in that? Why not:

1) tell sa-learn the way you want it trained,
2) Let sa-learn check its "bayes_seen" database and which messages were
already learned properly, and it will automatically skip them, saving
the processing time by not relearning them.


sa-learn will already skip messages that are correctly learned.
Completely automatically. In fact, there's no way to cause it to not
skip a message that's already been properly learned.

What does your proposal do that sa-learn already doesn't?
>> I never suggested that you should parse the headers. sa-learn does this
>> to extract the message-id and compare that to the bayes_seen database.
>> sa-learn *MUST* do this much to determine if the message has already
>> been learned. There's NO other way.
>
> even so, it should be possible to parse the message, extract the
> message-id, and compare the results in << 20 seconds. 
Yep. If you're feeding single messages to sa-learn at a time, I can see
how you'd have the perception that it's really slow to make this
decision. Most of the time would be spent loading sa-learn.

Try this experiment:

Set up a directory full of messages you want to train.

time sa-learn on it, and feed it the WHOLE DIRECTORY at once. Do not
iterate messages, do not specify filenames, just give sa-learn the name
of the directory.

time sa-learn --spam /some/folder/of/spam/

Then, afterwards, re-run it. The second time it should skip all the
messages, and should run substantially faster. If it's not, and the
first pass did learn messages, you've got a problem.

>
>> That's a "separate sorter". sa-learn already does this internally, so
>> *any* code on your part is a waste.
>
> if sa-learn already does this internally then it's doing it rather
> inefficiently.  20 seconds to pull a message id and compare it against
> the db (berkeleydb, fwiw)? 
I'd venture to guess sa-learn spends most of that time loading the perl
interpreter and shutting it back down. That's why you really should
avoid feeding single messages to sa-learn, it's really slow. It's also
why "spamassassin" is much slower than the spamc/spamd pair.

The other possibility is you've got write-lock contention. You can avoid
a lot of this by using the bayes_learn_to_journal option, at the expense
of causing your training to not take effect until the next sync.


As a matter of fact, using berkelydb, you should be able to LEARN 2,000
messages in 124 seconds on a p4.

Phase 1a and 1b are both learning 2000 fresh messages into the DB, run
as a single batch using a mbox file. However, these tests are purely
testing sa-learn. No live mail scanning is accessing the baye DB at the
time.

http://wiki.apache.org/spamassassin/BayesBenchmarkResults

Reply via email to