Faisal N Jawdat wrote: > On Apr 16, 2007, at 9:34 PM, Matt Kettler wrote: >> Try to learn it, if it comes back with something to the affect of: >> "learned from 0 messages, processed 1.." then it's already been learned. > First, sorry for taking so long to get back to you.. I've been absurdly busy at work lately. > this seems to be the common suggestion. > > it has a couple drawbacks, as i see it: > > 1. it's relatively cpu-intensive if i want to do it all the time > (e.g. scan my spam folder to learn only the messages which haven't > already been learned) If this is your SOLE desire, yes, because sa-learn will at the same time also learn every message that's not been learned before. > > 2. which way do i learn it. Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as nonspam. What's the problem here? > > to step back a bit, my final goal is to be able to figure out which > messages in a folder haven't been learned, and learn only those. in > the ideal situation i can also figure out (ahead of time), whether a > learned message was learned as ham or spam.
You do realize that the above idea would be *SLOWER* than feeding all the messages to sa-learn.. Right? sa-learn already internally recognizes which messages are already learned and skips them. What you want to do would reduce efficiency by making SA take two passes. In the first pass, it parses all the headers of every message, and tells you which ones it's learned or not. Then you use some external sorter Then you call SA to learn the messages that weren't learned. It now has to re-parse the headers from scratch, then parse/tokenize and learn the body. Your sorter and the re-parsing of the headers is excess overhead that would have been eliminated by letting sa-learn just handle it all the first time through.