Faisal N Jawdat wrote:
> On Apr 21, 2007, at 1:30 AM, Matt Kettler wrote:
>>> 2.  which way do i learn it.
>>
>> Erm, if it's spam, learn it as spam.. if it's nonspam, learn it as
>> nonspam. What's the problem here?
>
> i have a program looking through for untrained messages and deciding
> what to train them as.  alternatively, i have a program looking
> through and training all messages in a folder, deciding how to train
> on the fly.
Ok, but how does knowing what SA learned it as help? It doesn't.

Figure out what to train as, and train.

>
>> What you want to do would reduce efficiency by making SA take two
>> passes. In the first pass, it parses all the headers of every
>> message, and tells you which ones it's learned or not.
>
> a couple issues here:
>
> 1.  the headers do not necessarily tell the truth -- if you train on a
> message after it arrives then the headers will still say the same as
> written at delivery time.  and, as you point out, parsing the headers
> is an ugly way to do it.
I never suggested that you should parse the headers. sa-learn does this
to extract the message-id and compare that to the bayes_seen database.
sa-learn *MUST* do this much to determine if the message has already
been learned. There's NO other way.
>
> 2.  depending on how fast the "have i trained this message before"
> lookup is, this could still beat training every message.  as it is i'm
> looking at 19-20 seconds to [not] retrain a previously trained
> messages on a fairly unloaded box.
>
> i'm guess i could write a wrapper script around the sa-learn functions
> to keep a seperate db of what has and hasn't been trained.
But *WHY*.. Spamassassin already has such a database! And it uses it for
the exact same purpose you propose!

What you're ultimately trying to do is redundant. sa-learn already
handles all of this.
>
>> Then you use some external sorter
>> Then you call SA to learn the messages that weren't learned. It now has
>> to re-parse the headers from scratch, then parse/tokenize and learn the
>> body.
>
> why call a separate sorter?  do something more like:
>
> for my $message (@messages) {
>   learn($message) unless (already_learned($message))
> }
That's a "separate sorter". sa-learn already does this internally, so
*any* code on your part is a waste.

Why are you wanting to redundantly implement a feature sa-learn already
does?
How are you going to do it any "better"?

Reply via email to