Faisal N Jawdat wrote: > On Apr 21, 2007, at 11:23 AM, Matt Kettler wrote: >> Ok, but how does knowing what SA learned it as help? It doesn't. >> >> Figure out what to train as, and train. > > it helps in that i can automatically iterate over some or all of my > mail folders on a regular basis, selectively retraining *if*: > > a) the message has already been trained > b) it's been trained the same way that i want it trained in the end > and > c) the cost of determining it's already been trained is substantially > lower than the cost of just training it But what's the point in that? Why not:
1) tell sa-learn the way you want it trained, 2) Let sa-learn check its "bayes_seen" database and which messages were already learned properly, and it will automatically skip them, saving the processing time by not relearning them. sa-learn will already skip messages that are correctly learned. Completely automatically. In fact, there's no way to cause it to not skip a message that's already been properly learned. What does your proposal do that sa-learn already doesn't? >> I never suggested that you should parse the headers. sa-learn does this >> to extract the message-id and compare that to the bayes_seen database. >> sa-learn *MUST* do this much to determine if the message has already >> been learned. There's NO other way. > > even so, it should be possible to parse the message, extract the > message-id, and compare the results in << 20 seconds. Yep. If you're feeding single messages to sa-learn at a time, I can see how you'd have the perception that it's really slow to make this decision. Most of the time would be spent loading sa-learn. Try this experiment: Set up a directory full of messages you want to train. time sa-learn on it, and feed it the WHOLE DIRECTORY at once. Do not iterate messages, do not specify filenames, just give sa-learn the name of the directory. time sa-learn --spam /some/folder/of/spam/ Then, afterwards, re-run it. The second time it should skip all the messages, and should run substantially faster. If it's not, and the first pass did learn messages, you've got a problem. > >> That's a "separate sorter". sa-learn already does this internally, so >> *any* code on your part is a waste. > > if sa-learn already does this internally then it's doing it rather > inefficiently. 20 seconds to pull a message id and compare it against > the db (berkeleydb, fwiw)? I'd venture to guess sa-learn spends most of that time loading the perl interpreter and shutting it back down. That's why you really should avoid feeding single messages to sa-learn, it's really slow. It's also why "spamassassin" is much slower than the spamc/spamd pair. The other possibility is you've got write-lock contention. You can avoid a lot of this by using the bayes_learn_to_journal option, at the expense of causing your training to not take effect until the next sync. As a matter of fact, using berkelydb, you should be able to LEARN 2,000 messages in 124 seconds on a p4. Phase 1a and 1b are both learning 2000 fresh messages into the DB, run as a single batch using a mbox file. However, these tests are purely testing sa-learn. No live mail scanning is accessing the baye DB at the time. http://wiki.apache.org/spamassassin/BayesBenchmarkResults