On Wed, 6 Feb 2013, Ben Johnson wrote:
On 2/1/2013 7:58 PM, John Hardin wrote:
That latter brings up another concern for the vetted-corpora model: if a
message is *removed* from a training corpora mailbox rather than
reclassified, you'd have to wipe and retrain your database from scratch
to remove that message's effects.
So, you need *three* vetted corpus mailboxes: spam, ham, and
should-not-have-been-trained (forget). Rather than deleting a message
from the ham or spam corpus mailbox you move it to the forget mailbox
and the in next training pass sa-learn forgets the message and removes
it from the forget mailbox. This would be some special scripting,
because you can't just "sa-learn --forget" a whole mailbox.
There would also need to be an audit process to detect whether the same
message_id is in both the ham and spam corpus mailboxes, so that the
admin can delete (NOT forget) the incorrect classification, or forget
the message if neither classification is reasonable.
You reveal some crucial information with regard to corpora management
here, John.
I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".
It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?
I would suggest: *move* one to the Forget folder and delete the other.
I am assuming that learning from the vetted corpora folders is on a
schedule rather than in real-time, so that you have a liberal window for
completing these operations.
With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome
Yeah.
because you can't just "sa-learn --forget" a whole mailbox.
Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?
Oh, the logic would be the same, it's just not implemented. That's why you
can't do it. :)
Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:
1.) I agree with the end-user's classification.
2.) I disagree with the end-user's classification.
a.) Because the message was submitted as ham but is really spam (or
vice versa)
b.) Because neither classification is reasonable
In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.
I don't see any value to retaining them in the public submission folders.
In fact, you may want to make the ham submission folder write-only (if
that's possible) in order to help preserve your individual users' privacy.
In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?
You might want to do that if you intend to approach the user and train
them about why it wasn't a correct submission and you want evidence - for
example, to say that this looks like a message from a legitimate mailing
list that they intentionally subscribed to at some point, and the
unsubscribe link is right there (points at screen).
Apart from that, I don't see a reason to keep erroneous submissions
either.
I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.
Yeah, that assumes familiarity with these issues, and managing masscheck
corpora is a slightly different task than managing user-fed Bayes training
corpora.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...we talk about creating "millions of shovel-ready jobs" for a
society that doesn't really encourage people to pick up a shovel.
-- Mike Rowe, testifying before Congress
-----------------------------------------------------------------------
6 days until Abraham Lincoln's and Charles Darwin's 204th Birthdays