Faisal N Jawdat wrote:
> On Apr 16, 2007, at 9:34 PM, Matt Kettler wrote:
>> Try to learn it, if it comes back with something to the affect of:
>> "learned from 0 messages, processed 1.." then it's already been learned.
>
First, sorry for taking so long to get back to you.. I've been absurdly
busy at work lately.
> this seems to be the common suggestion.
>
> it has a couple drawbacks, as i see it:
>
> 1.  it's relatively cpu-intensive if i want to do it all the time
> (e.g. scan my spam folder to learn only the messages which haven't
> already been learned)
If this is your SOLE desire, yes, because sa-learn will at the same time
also learn every message that's not been learned before.
>
> 2.  which way do i learn it.
Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as
nonspam. What's the problem here?
>
> to step back a bit, my final goal is to be able to figure out which
> messages in a folder haven't been learned, and learn only those.  in
> the ideal situation i can also figure out (ahead of time), whether a
> learned message was learned as ham or spam.

You do realize that the above idea would be *SLOWER* than feeding all
the messages to sa-learn.. Right?

sa-learn already internally recognizes which messages are already
learned and skips them.

What you want to do would reduce efficiency by making SA take two
passes. In the first pass, it parses all the headers of every message,
and tells you which ones it's learned or not.
Then you use some external sorter
Then you call SA to learn the messages that weren't learned. It now has
to re-parse the headers from scratch, then parse/tokenize and learn the
body.

Your sorter and the re-parsing of the headers is excess overhead that
would have been eliminated by letting sa-learn just handle it all the
first time through.




Reply via email to