On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote:
Ok, but how does knowing what SA learned it as help? It doesn't.
Figure out what to train as, and train.
it helps in that i can automatically iterate over some or all of my
mail folders on a regular basis, selectively retraining *if*:
a) the message has already been trained
b) it's been trained the same way that i want it trained in the end
and
c) the cost of determining it's already been trained is substantially
lower than the cost of just training it
right now i do this manually: i have a "retrain as spam" folder and
a "retrain as ham" folder and i hit them each every 5 minutes. i'd
rather get rid of the folders, which lets me then use the client-side
junk mail systems to flag messages as spam or ham, which sa would
then pick up to retrain.
I never suggested that you should parse the headers. sa-learn does
this
to extract the message-id and compare that to the bayes_seen database.
sa-learn *MUST* do this much to determine if the message has already
been learned. There's NO other way.
even so, it should be possible to parse the message, extract the
message-id, and compare the results in << 20 seconds.
That's a "separate sorter". sa-learn already does this internally,
so *any* code on your part is a waste.
if sa-learn already does this internally then it's doing it rather
inefficiently. 20 seconds to pull a message id and compare it
against the db (berkeleydb, fwiw)?
-faisal