On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote:
Ok, but how does knowing what SA learned it as help? It doesn't.

Figure out what to train as, and train.

it helps in that i can automatically iterate over some or all of my mail folders on a regular basis, selectively retraining *if*:

a) the message has already been trained
b) it's been trained the same way that i want it trained in the end
and
c) the cost of determining it's already been trained is substantially lower than the cost of just training it

right now i do this manually: i have a "retrain as spam" folder and a "retrain as ham" folder and i hit them each every 5 minutes. i'd rather get rid of the folders, which lets me then use the client-side junk mail systems to flag messages as spam or ham, which sa would then pick up to retrain.

I never suggested that you should parse the headers. sa-learn does this
to extract the message-id and compare that to the bayes_seen database.
sa-learn *MUST* do this much to determine if the message has already
been learned. There's NO other way.

even so, it should be possible to parse the message, extract the message-id, and compare the results in << 20 seconds.

That's a "separate sorter". sa-learn already does this internally, so *any* code on your part is a waste.

if sa-learn already does this internally then it's doing it rather inefficiently. 20 seconds to pull a message id and compare it against the db (berkeleydb, fwiw)?

-faisal

Reply via email to