Faisal N Jawdat wrote: > On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote: >>> 2. which way do i learn it. >> >> Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as >> nonspam. What's the problem here? > > i have a program looking through for untrained messages and deciding > what to train them as. alternatively, i have a program looking > through and training all messages in a folder, deciding how to train > on the fly. Ok, but how does knowing what SA learned it as help? It doesn't.
Figure out what to train as, and train. > >> What you want to do would reduce efficiency by making SA take two >> passes. In the first pass, it parses all the headers of every >> message, and tells you which ones it's learned or not. > > a couple issues here: > > 1. the headers do not necessarily tell the truth -- if you train on a > message after it arrives then the headers will still say the same as > written at delivery time. and, as you point out, parsing the headers > is an ugly way to do it. I never suggested that you should parse the headers. sa-learn does this to extract the message-id and compare that to the bayes_seen database. sa-learn *MUST* do this much to determine if the message has already been learned. There's NO other way. > > 2. depending on how fast the "have i trained this message before" > lookup is, this could still beat training every message. as it is i'm > looking at 19-20 seconds to [not] retrain a previously trained > messages on a fairly unloaded box. > > i'm guess i could write a wrapper script around the sa-learn functions > to keep a seperate db of what has and hasn't been trained. But *WHY*.. Spamassassin already has such a database! And it uses it for the exact same purpose you propose! What you're ultimately trying to do is redundant. sa-learn already handles all of this. > >> Then you use some external sorter >> Then you call SA to learn the messages that weren't learned. It now has >> to re-parse the headers from scratch, then parse/tokenize and learn the >> body. > > why call a separate sorter? do something more like: > > for my $message (@messages) { > learn($message) unless (already_learned($message)) > } That's a "separate sorter". sa-learn already does this internally, so *any* code on your part is a waste. Why are you wanting to redundantly implement a feature sa-learn already does? How are you going to do it any "better"?