On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote:
2. which way do i learn it.
Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as
nonspam. What's the problem here?
i have a program looking through for untrained messages and deciding
what to train them as. alternatively, i have a program looking
through and training all messages in a folder, deciding how to train
on the fly.
What you want to do would reduce efficiency by making SA take two
passes. In the first pass, it parses all the headers of every
message, and tells you which ones it's learned or not.
a couple issues here:
1. the headers do not necessarily tell the truth -- if you train on
a message after it arrives then the headers will still say the same
as written at delivery time. and, as you point out, parsing the
headers is an ugly way to do it.
2. depending on how fast the "have i trained this message before"
lookup is, this could still beat training every message. as it is
i'm looking at 19-20 seconds to [not] retrain a previously trained
messages on a fairly unloaded box.
i'm guess i could write a wrapper script around the sa-learn
functions to keep a seperate db of what has and hasn't been trained.
Then you use some external sorter
Then you call SA to learn the messages that weren't learned. It now
has
to re-parse the headers from scratch, then parse/tokenize and learn
the
body.
why call a separate sorter? do something more like:
for my $message (@messages) {
learn($message) unless (already_learned($message))
}
-faisal