Re: sa-learn: have i seen this before?

Faisal N Jawdat Sat, 21 Apr 2007 01:29:47 -0700

On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote:

2.  which way do i learn it.
Erm, if it's spam, learn it as spam.. if it's nonspam, learn it asnonspam. What's the problem here?

i have a program looking through for untrained messages and decidingwhat to train them as. alternatively, i have a program lookingthrough and training all messages in a folder, deciding how to trainon the fly.

What you want to do would reduce efficiency by making SA take twopasses. In the first pass, it parses all the headers of everymessage, and tells you which ones it's learned or not.


a couple issues here:

1. the headers do not necessarily tell the truth -- if you train ona message after it arrives then the headers will still say the sameas written at delivery time. and, as you point out, parsing theheaders is an ugly way to do it.

2. depending on how fast the "have i trained this message before"lookup is, this could still beat training every message. as it isi'm looking at 19-20 seconds to [not] retrain a previously trainedmessages on a fairly unloaded box.

i'm guess i could write a wrapper script around the sa-learnfunctions to keep a seperate db of what has and hasn't been trained.

Then you use some external sorter
Then you call SA to learn the messages that weren't learned. It nowhasto re-parse the headers from scratch, then parse/tokenize and learnthe
body.


why call a separate sorter?  do something more like:

for my $message (@messages) {
  learn($message) unless (already_learned($message))
}

-faisal

Re: sa-learn: have i seen this before?

Reply via email to