Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

John Hardin Wed, 06 Feb 2013 09:36:49 -0800

On Wed, 6 Feb 2013, Ben Johnson wrote:

On 2/1/2013 7:58 PM, John Hardin wrote:


That latter brings up another concern for the vetted-corpora model: if a
message is *removed* from a training corpora mailbox rather than
reclassified, you'd have to wipe and retrain your database from scratch
to remove that message's effects.

So, you need *three* vetted corpus mailboxes: spam, ham, and
should-not-have-been-trained (forget). Rather than deleting a message
from the ham or spam corpus mailbox you move it to the forget mailbox
and the in next training pass sa-learn forgets the message and removes
it from the forget mailbox. This would be some special scripting,
because you can't just "sa-learn --forget" a whole mailbox.

There would also need to be an audit process to detect whether the same
message_id is in both the ham and spam corpus mailboxes, so that the
admin can delete (NOT forget) the incorrect classification, or forget
the message if neither classification is reasonable.


You reveal some crucial information with regard to corpora management
here, John.

I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".

It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?


I would suggest: *move* one to the Forget folder and delete the other.

I am assuming that learning from the vetted corpora folders is on aschedule rather than in real-time, so that you have a liberal window forcompleting these operations.

With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome


Yeah.

because you can't just "sa-learn --forget" a whole mailbox.


Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?

Oh, the logic would be the same, it's just not implemented. That's why youcan't do it. :)

Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:

1.) I agree with the end-user's classification.

2.) I disagree with the end-user's classification.
        a.) Because the message was submitted as ham but is really spam (or
vice versa)
        b.) Because neither classification is reasonable

In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.


I don't see any value to retaining them in the public submission folders.

In fact, you may want to make the ham submission folder write-only (ifthat's possible) in order to help preserve your individual users' privacy.

In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?

You might want to do that if you intend to approach the user and trainthem about why it wasn't a correct submission and you want evidence - forexample, to say that this looks like a message from a legitimate mailinglist that they intentionally subscribed to at some point, and theunsubscribe link is right there (points at screen).

Apart from that, I don't see a reason to keep erroneous submissionseither.

I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.

Yeah, that assumes familiarity with these issues, and managing masscheckcorpora is a slightly different task than managing user-fed Bayes trainingcorpora.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...we talk about creating "millions of shovel-ready jobs" for a
  society that doesn't really encourage people to pick up a shovel.
                             -- Mike Rowe, testifying before Congress
-----------------------------------------------------------------------
 6 days until Abraham Lincoln's and Charles Darwin's 204th Birthdays

Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

Reply via email to